PySpark for Data Science - II : Statistics for Big Data

Go to main | Course Page

Section 1: Introduction to Statistics for Big Data

  1. Introduction to Statistics for Big Data
  2. Download Resources

Section 2: Mean

  1. Calculate the mean of a list
  2. Mean for a PySpark DataFrame

Section 3: Median

  1. How to calculate the Median of a list?
  2. Calculate Median of a PySpark DataFrame column
  3. Using PySpark Window function

Section 4: Mode

  1. How to calculate the Mode of a list
  2. How to calculate Mode on a PySpark DataFrame?

Section 5: Deciles and Quartiles

  1. How to calculate deciles and quantiles?
  2. Using approxQuantile on multiple columns

Section 6: Variance

  1. Calculating Using PySpark RDD's
  2. On PySpark DataFrame columns

Section 7: Standard Deviation

  1. Using the describe() function
  2. Using the agg() based methods
  3. Using the selectExpr() function with SQL expressions

Section 8:Correlation

  1. Using DataFrame API
  2. Using MLlib
  3. Correlation matrix and heat map

Section 9: T-Test

  1. T-test

Section 10: F-Test

  1. F-Test

Section 11: Chi-Square Test

  1. Chi-Square Test
Report abuse