PySpark for Data Science - V : ML Pipelines

Go to main | Course Page

Section 1: Recap Decision Trees

  1. Reading Decision Tree
  2. Download Resources
  3. How Decision Tree Works?

Section 2: Build Decision Trees in PySpark

  1. Import required libraries and initialize SparkSession
  2. Load the dataset
  3. Prepare the data
  4. Building the Decision TreeClassifier model
  5. Evaluating the model on test data
  6. Feature Importance
  7. Improve the model (optional)

Section 3: Tuning the Tree with Pipelines

  1. Creating a Pipeline & Hyperparameter Tuning

Section 4: Self Assessment

  1. Random Forest Approach
  2. Gradient Boosting
  3. Compare results between 4 approaches

Section 5: XGBoost model using PySpark

  1. The problem with XGBoost for PySpark
  2. Install XGBoost in PySpark
  3. Run XGBoost in PySpark
Report abuse