PySpark for Data Science - III : Data Cleaning and Analysis Methods

Go to main | Course Page

Section 1: Introduction to Data PreProcessing

Data Preprocessing Essence
Download Resources
Automate Variable type Identification
Self Assessment - 1

Section 2: Outlier Detection and Treatment

Outlier Treatment Approaches
How to Detect and Treat Outliers using IQR method?
How to detect and treat outliers using the z-score?
Identifying and Removing Duplicates
Self Assessment - 2

Section 3: Missing Value Imputation

Missing Value Treatment approaches
Approach 1: Detect and drop the row with missing values
Approach 2: Check conditions and drop an entire Column
Approach 3: Impute the missing data appropriately
Self assessment - 3

Section 4: Feature Encoding

StringIndexer
OneHot Encoding

Section 5: Feature scaling

Feature Min-Max scaling
Feature Standardization

Section 6: Feature Extraction / Dimensionality Reduction

Variance Inflation Factor (VIF)
PCA (Principal Component Analysis)
Final Self Assessment

Published with Simplenote