This lesson is still being designed and assembled (Pre-Alpha version)

This lesson is part of The Carpentries Incubator, a place to share and use each other's Carpentries-style lessons. This lesson has not been reviewed by and is not endorsed by The Carpentries.

Exploring and Modeling High-Dimensional Data: Glossary

Key Points

Exploring high dimensional data	data can be anything - as long as you can represent it in a computer A dimension is a feature in a dataset - i.e. a column, but NOT an index. an index is not a dimension
The Ames housing dataset
Predictive vs. explanatory regression	Linear regression models can be used to predict a target variable and/or to reveal relationships between variables Linear models are most effective when applied to linear relationships. Data transformation techniques can be used to help ensure that only linear relationships are modelled. Train/test splits are used to assess under/overfitting in a model Different model evaluation metrics provide different perspectives of model error. Some error measurements, such as R-squared, are not as relevant for explanatory models.
Model validity - relevant predictors	All models are wrong, but some are useful. Before reading into a model’s estimated coefficients, modelers must take care to account for essential predictor variables Models that do not account for essential predictor variables can produce distorted pictures of reality due to omitted variable bias and confounding effects.
Model validity - regression assumptions	All models are wrong, but some are useful. Before reading into a model’s estimated coefficients, modelers must take care to check for evidence of overfitting and assess the 5 assumptions of linear regression. One-hot enoding, while necesssary, can often produce very sparse binary predictors which have little information. Predictors with very little variability should be removed prior to model fitting.
Model interpretation and hypothesis testing	All models are wrong, but some are useful.
Feature selection with PCA
Unpacking PCA	PCA transforms your original data by projecting it into new axes primary components are orthogonal vectors through your data that explain the most variability
Regularization methods - lasso, ridge, and elastic net
Exploring additional datasets
Clustering high dimensional data	TODO
Data visualization techniques: t-SNE and PaCMAP	TODO

Glossary

FIXME