This lesson is still being designed and assembled (Pre-Alpha version)

Exploring and Modeling High-Dimensional Data: Glossary

Key Points

Exploring high dimensional data
  • data can be anything - as long as you can represent it in a computer

  • A dimension is a feature in a dataset - i.e. a column, but NOT an index.

  • an index is not a dimension

The Ames housing dataset
Predictive vs. explanatory regression
  • Linear regression models can be used to predict a target variable and/or to reveal relationships between variables

  • Linear models are most effective when applied to linear relationships. Data transformation techniques can be used to help ensure that only linear relationships are modelled.

  • Train/test splits are used to assess under/overfitting in a model

  • Different model evaluation metrics provide different perspectives of model error. Some error measurements, such as R-squared, are not as relevant for explanatory models.

Model validity - relevant predictors
  • All models are wrong, but some are useful.

  • Before reading into a model’s estimated coefficients, modelers must take care to account for essential predictor variables

  • Models that do not account for essential predictor variables can produce distorted pictures of reality due to omitted variable bias and confounding effects.

Model validity - regression assumptions
  • All models are wrong, but some are useful.

  • Before reading into a model’s estimated coefficients, modelers must take care to check for evidence of overfitting and assess the 5 assumptions of linear regression.

  • One-hot enoding, while necesssary, can often produce very sparse binary predictors which have little information. Predictors with very little variability should be removed prior to model fitting.

Model interpretation and hypothesis testing
  • All models are wrong, but some are useful.

Feature selection with PCA
Unpacking PCA
  • PCA transforms your original data by projecting it into new axes

  • primary components are orthogonal vectors through your data that explain the most variability

Regularization methods - lasso, ridge, and elastic net
Exploring additional datasets
Clustering high dimensional data
  • TODO

Data visualization techniques: t-SNE and PaCMAP
  • TODO

Glossary

FIXME