IntroductionWhat is machine learning?Getting started with Scikit-LearnWhat will we cover today?


  • Machine learning is a set of tools and techniques that use data to make predictions.
  • Artificial intelligence is a broader term that refers to making computers show human-like intelligence.
  • Deep learning is a subset of machine learning.
  • All machine learning systems have limitations to be aware of.

Supervised methods - RegressionSupervised learningRegression


  • A supervised learning pipeline includes data loading, cleaning, feature selection, training, and testing.
  • Scikit-Learn provides simple, consistent tools for regression, model fitting, and performance evaluation.
  • Always split data into train and test sets to avoid overfitting and to assess model generalization.
  • Dummy coding (one-hot encoding) converts categorical variables into a numeric form usable by ML models.
  • Polynomial regression can capture non-linear trends by expanding features into polynomial terms.
  • Early exploratory data analysis (EDA) helps reveal relationships, clusters, and potential predictors before modeling.
  • Overfitting occurs when a model learns noise instead of signal—simpler models and good splits help mitigate this.

Supervised methods - ClassificationClassification


  • Classification is a supervised learning task where the goal is to predict discrete class labels from labeled examples.
  • Train/test splits let us estimate how well a classifier will generalize to unseen data; for classification, stratifying by class is often important.
  • Decision trees are easy to train and interpret, but can overfit when depth and other hyperparameters are not controlled.
  • Hyperparameters (such as max_depth) control model complexity and behavior but are not learned directly from the data.
  • Models that rely on distances or geometric margins in feature space (such as SVMs) usually require standardized inputs; tree-based models typically do not.
  • Comparing different classifiers (for example, decision trees vs SVMs) on the same train/test split helps reveal tradeoffs between accuracy, robustness, and interpretability.

Ensemble methodsEnsemble methods


  • Ensemble methods combine predictions from multiple models to produce more stable and accurate results than most single models.
  • Bagging (such as random forests) trains the same model on different bootstrap samples and averages their predictions, usually reducing variance and overfitting.
  • Boosting trains models in sequence, focusing later models on the mistakes of earlier ones, often improving accuracy at the cost of increased complexity and computation.
  • Stacking uses a meta-model to combine the outputs of several diverse base models trained on the same data.
  • Random forests often outperform single decision trees by averaging many shallow, noisy trees into a more robust classifier.
  • VotingRegressor and VotingClassifier provide a simple way to stack multiple estimators in Scikit-Learn for regression or classification tasks.
  • Choosing an ensemble method and tuning its hyperparameters is closely tied to the bias–variance tradeoff and the characteristics of the dataset.

Unsupervised methods - ClusteringUnsupervised learningClustering


  • Clustering is a form of unsupervised learning.
  • Unsupervised learning algorithms don’t need training.
  • Kmeans is a popular clustering algorithm.
  • Kmeans is less useful when one cluster exists within another, such as concentric circles.
  • Spectral clustering can overcome some of the limitations of Kmeans.
  • Spectral clustering is much slower than Kmeans.
  • Scikit-Learn has functions to create example data.

Unsupervised methods - Dimensionality reductionDimensionality reductionDimensionality reduction with Scikit-Learn


  • PCA is a linear dimensionality reduction technique for tabular data
  • t-SNE is another dimensionality reduction technique for tabular data that is more general than PCA

Neural NetworksNeural networks


  • Perceptrons are artificial neurons which build neural networks.
  • A perceptron takes multiple inputs, multiplies each by a weight value and sums the weighted inputs. It then applies an activation function to the sum.
  • A single perceptron can solve simple functions which are linearly separable.
  • Multiple perceptrons can be combined to form a neural network which can solve functions that aren’t linearly separable.
  • We can train a whole neural network with the back propagation algorithm. Scikit-learn includes an implementation of this algorithm.
  • Training a neural network requires some training data to show the network examples of what to learn.
  • To validate our training we split the training data into a training set and a test set.
  • To ensure the whole dataset can be used in training and testing we can train multiple times with different subsets of the data acting as training/testing data. This is called cross validation.
  • Deep learning neural networks are a very powerful modern machine learning technique. Scikit-Learn does not support these but other libraries like Tensorflow do.
  • Several companies now offer cloud APIs where we can train neural networks on powerful computers.

Ethics and the Implications of Machine LearningEthics and machine learningEthics of machine learning in research


  • The results of machine learning reflect biases in the training and input data.
  • Many machine learning algorithms can’t explain how they arrived at a decision.
  • Machine learning can be used for unethical purposes.
  • Consider the implications of false positives and false negatives.

Find out moreOther algorithms


  • This course has only touched on a few areas of machine learning and is designed to teach you just enough to do something useful.
  • Machine learning is a rapidly evolving field and new tools and techniques are constantly appearing.