PySpark for Data Science - V : ML Pipelines

Go to main | Course Page

Section 1: Recap Decision Trees

Reading Decision Tree
Download Resources
How Decision Tree Works?

Section 2: Build Decision Trees in PySpark

Import required libraries and initialize SparkSession
Load the dataset
Prepare the data
Building the Decision TreeClassifier model
Evaluating the model on test data
Feature Importance
Improve the model (optional)

Section 3: Tuning the Tree with Pipelines

Creating a Pipeline & Hyperparameter Tuning

Section 4: Self Assessment

Random Forest Approach
Gradient Boosting
Compare results between 4 approaches

Section 5: XGBoost model using PySpark

The problem with XGBoost for PySpark
Install XGBoost in PySpark
Run XGBoost in PySpark

Published with Simplenote