IntroductionWhat is machine learning?Getting started with Scikit-LearnWhat will we cover today?
- Machine learning is a set of tools and techniques that use data to make predictions.
- Artificial intelligence is a broader term that refers to making computers show human-like intelligence.
- Deep learning is a subset of machine learning.
- All machine learning systems have limitations to be aware of.
Supervised methods - RegressionSupervised learningRegression
- A supervised learning pipeline includes data loading, cleaning,
feature selection, training, and testing.
- Scikit-Learn provides simple, consistent tools for regression, model
fitting, and performance evaluation.
- Always split data into train and test sets to avoid overfitting and
to assess model generalization.
- Dummy coding (one-hot encoding) converts categorical variables into
a numeric form usable by ML models.
- Polynomial regression can capture non-linear trends by expanding
features into polynomial terms.
- Early exploratory data analysis (EDA) helps reveal relationships,
clusters, and potential predictors before modeling.
- Overfitting occurs when a model learns noise instead of signal—simpler models and good splits help mitigate this.
Supervised methods - ClassificationClassification
- Classification is a supervised learning task where the goal is to predict discrete class labels from labeled examples.
- Train/test splits let us estimate how well a classifier will generalize to unseen data; for classification, stratifying by class is often important.
- Decision trees are easy to train and interpret, but can overfit when depth and other hyperparameters are not controlled.
- Hyperparameters (such as
max_depth) control model complexity and behavior but are not learned directly from the data. - Models that rely on distances or geometric margins in feature space (such as SVMs) usually require standardized inputs; tree-based models typically do not.
- Comparing different classifiers (for example, decision trees vs SVMs) on the same train/test split helps reveal tradeoffs between accuracy, robustness, and interpretability.
Ensemble methodsEnsemble methods
- Ensemble methods combine predictions from multiple models to produce more stable and accurate results than most single models.
- Bagging (such as random forests) trains the same model on different bootstrap samples and averages their predictions, usually reducing variance and overfitting.
- Boosting trains models in sequence, focusing later models on the mistakes of earlier ones, often improving accuracy at the cost of increased complexity and computation.
- Stacking uses a meta-model to combine the outputs of several diverse base models trained on the same data.
- Random forests often outperform single decision trees by averaging many shallow, noisy trees into a more robust classifier.
- VotingRegressor and VotingClassifier provide a simple way to stack multiple estimators in Scikit-Learn for regression or classification tasks.
- Choosing an ensemble method and tuning its hyperparameters is closely tied to the bias–variance tradeoff and the characteristics of the dataset.
Unsupervised methods - ClusteringUnsupervised learningClustering
- Clustering is a form of unsupervised learning.
- Unsupervised learning algorithms don’t need training.
- Kmeans is a popular clustering algorithm.
- Kmeans is less useful when one cluster exists within another, such as concentric circles.
- Spectral clustering can overcome some of the limitations of Kmeans.
- Spectral clustering is much slower than Kmeans.
- Scikit-Learn has functions to create example data.
Unsupervised methods - Dimensionality reductionDimensionality reductionDimensionality reduction with Scikit-Learn
- PCA is a linear dimensionality reduction technique for tabular data
- t-SNE is another dimensionality reduction technique for tabular data that is more general than PCA
Neural NetworksNeural networks
- Perceptrons are artificial neurons which build neural networks.
- A perceptron takes multiple inputs, multiplies each by a weight value and sums the weighted inputs. It then applies an activation function to the sum.
- A single perceptron can solve simple functions which are linearly separable.
- Multiple perceptrons can be combined to form a neural network which can solve functions that aren’t linearly separable.
- We can train a whole neural network with the back propagation algorithm. Scikit-learn includes an implementation of this algorithm.
- Training a neural network requires some training data to show the network examples of what to learn.
- To validate our training we split the training data into a training set and a test set.
- To ensure the whole dataset can be used in training and testing we can train multiple times with different subsets of the data acting as training/testing data. This is called cross validation.
- Deep learning neural networks are a very powerful modern machine learning technique. Scikit-Learn does not support these but other libraries like Tensorflow do.
- Several companies now offer cloud APIs where we can train neural networks on powerful computers.
Ethics and the Implications of Machine LearningEthics and machine learningEthics of machine learning in research
- The results of machine learning reflect biases in the training and input data.
- Many machine learning algorithms can’t explain how they arrived at a decision.
- Machine learning can be used for unethical purposes.
- Consider the implications of false positives and false negatives.
Find out moreOther algorithms
- This course has only touched on a few areas of machine learning and is designed to teach you just enough to do something useful.
- Machine learning is a rapidly evolving field and new tools and techniques are constantly appearing.