PySpark for Data Science - III : Data Cleaning and Analysis Methods

Go to main | Course Page

Section 1: Introduction to Data PreProcessing

  1. Data Preprocessing Essence
  2. Download Resources
  3. Automate Variable type Identification
  4. Self Assessment - 1

Section 2: Outlier Detection and Treatment

  1. Outlier Treatment Approaches
  2. How to Detect and Treat Outliers using IQR method?
  3. How to detect and treat outliers using the z-score?
  4. Identifying and Removing Duplicates
  5. Self Assessment - 2

Section 3: Missing Value Imputation

  1. Missing Value Treatment approaches
  2. Approach 1: Detect and drop the row with missing values
  3. Approach 2: Check conditions and drop an entire Column
  4. Approach 3: Impute the missing data appropriately
  5. Self assessment - 3

Section 4: Feature Encoding

  1. StringIndexer
  2. OneHot Encoding

Section 5: Feature scaling

  1. Feature Min-Max scaling
  2. Feature Standardization

Section 6: Feature Extraction / Dimensionality Reduction

  1. Variance Inflation Factor (VIF)
  2. PCA (Principal Component Analysis)
  3. Final Self Assessment
Report abuse