PySpark for Data Science - II : Statistics for Big Data
Go to main | Course Page
Section 1: Introduction to Statistics for Big Data
- Introduction to Statistics for Big Data
- Download Resources
Section 2: Mean
- Calculate the mean of a list
- Mean for a PySpark DataFrame
- How to calculate the Median of a list?
- Calculate Median of a PySpark DataFrame column
- Using PySpark Window function
Section 4: Mode
- How to calculate the Mode of a list
- How to calculate Mode on a PySpark DataFrame?
Section 5: Deciles and Quartiles
- How to calculate deciles and quantiles?
- Using approxQuantile on multiple columns
Section 6: Variance
- Calculating Using PySpark RDD's
- On PySpark DataFrame columns
Section 7: Standard Deviation
- Using the describe() function
- Using the agg() based methods
- Using the selectExpr() function with SQL expressions
Section 8:Correlation
- Using DataFrame API
- Using MLlib
- Correlation matrix and heat map
Section 9: T-Test
- T-test
Section 10: F-Test
- F-Test
Section 11: Chi-Square Test
- Chi-Square Test