PySpark for Data Science - III : Data Cleaning and Analysis Methods
Go to main | Course Page
Section 1: Introduction to Data PreProcessing
- Data Preprocessing Essence
- Download Resources
- Automate Variable type Identification
- Self Assessment - 1
Section 2: Outlier Detection and Treatment
- Outlier Treatment Approaches
- How to Detect and Treat Outliers using IQR method?
- How to detect and treat outliers using the z-score?
- Identifying and Removing Duplicates
- Self Assessment - 2
Section 3: Missing Value Imputation
- Missing Value Treatment approaches
- Approach 1: Detect and drop the row with missing values
- Approach 2: Check conditions and drop an entire Column
- Approach 3: Impute the missing data appropriately
- Self assessment - 3
Section 4: Feature Encoding
- StringIndexer
- OneHot Encoding
Section 5: Feature scaling
- Feature Min-Max scaling
- Feature Standardization
- Variance Inflation Factor (VIF)
- PCA (Principal Component Analysis)
- Final Self Assessment