IntroductionWhat is machine learning?Getting started with Scikit-LearnWhat will we cover today?
- Machine learning is a set of tools and techniques that use data to make predictions.
- Artificial intelligence is a broader term that refers to making computers show human-like intelligence.
- Deep learning is a subset of machine learning.
- All machine learning systems have limitations to be aware of.
Supervised methods - RegressionSupervised learningRegression
- A supervised learning pipeline includes data loading, cleaning,
feature selection, training, and testing.
- Scikit-Learn provides simple, consistent tools for regression, model
fitting, and performance evaluation.
- Always split data into train and test sets to avoid overfitting and
to assess model generalization.
- Dummy coding (one-hot encoding) converts categorical variables into
a numeric form usable by ML models.
- Polynomial regression can capture non-linear trends by expanding
features into polynomial terms.
- Early exploratory data analysis (EDA) helps reveal relationships,
clusters, and potential predictors before modeling.
- Overfitting occurs when a model learns noise instead of signal—simpler models and good splits help mitigate this.
Supervised methods - ClassificationClassification
- Classification is a supervised learning task where the goal is to predict discrete class labels from labeled examples.
- Train/test splits let us estimate how well a classifier will generalize to unseen data; for classification, stratifying by class is often important.
- Decision trees are easy to train and interpret, but can overfit when depth and other hyperparameters are not controlled.
- Hyperparameters (such as
max_depth) control model complexity and behavior but are not learned directly from the data. - Models that rely on distances or geometric margins in feature space (such as SVMs) usually require standardized inputs; tree-based models typically do not.
- Comparing different classifiers (for example, decision trees vs SVMs) on the same train/test split helps reveal tradeoffs between accuracy, robustness, and interpretability.
Ensemble methodsEnsemble methods
- Ensemble methods combine predictions from multiple models to produce more stable and accurate results than most single models.
- Bagging (such as random forests) trains the same model on different bootstrap samples and averages their predictions, usually reducing variance and overfitting.
- Boosting trains models in sequence, focusing later models on the mistakes of earlier ones, often improving accuracy at the cost of increased complexity and computation.
- Stacking uses a meta-model to combine the outputs of several diverse base models trained on the same data.
- Random forests often outperform single decision trees by averaging many shallow, noisy trees into a more robust classifier.
- VotingRegressor and VotingClassifier provide a simple way to stack multiple estimators in Scikit-Learn for regression or classification tasks.
- Choosing an ensemble method and tuning its hyperparameters is closely tied to the bias–variance tradeoff and the characteristics of the dataset.
Unsupervised methods - ClusteringUnsupervised learningClustering
- Clustering is an unsupervised learning method that groups similar
data points without labeled outputs.
- K-means is simple, fast, and well-suited for large datasets, but it
assumes spherical clusters and linear boundaries.
- The number of clusters (k) must be specified in advance, and
silhouette scores can help evaluate and compare clustering
quality.
- K-means can perform poorly when clusters overlap or have complex
shapes, such as concentric circles.
- Spectral clustering overcomes many of k-means’ geometric limitations
using graph-based and kernel methods, but is slower and less
scalable.
- Scikit-Learn provides tools for generating datasets (make_blobs,
make_circles) and implementing both k-means and spectral
clustering.
- Choosing the right clustering method often depends on the dataset’s structure, size, and available computational resources.
Unsupervised methods - Dimensionality reductionDimensionality reductionDimensionality reduction with Scikit-Learn
- PCA is a linear dimensionality reduction technique for tabular data
- t-SNE is another dimensionality reduction technique for tabular data that is more general than PCA
Neural NetworksNeural networks
- Perceptrons are artificial neurons which build neural networks.
- A perceptron takes multiple inputs, multiplies each by a weight value and sums the weighted inputs. It then applies an activation function to the sum.
- A single perceptron can solve simple functions which are linearly separable.
- Multiple perceptrons can be combined to form a neural network which can solve functions that aren’t linearly separable.
- We can train a whole neural network with the back propagation algorithm. Scikit-learn includes an implementation of this algorithm.
- Training a neural network requires some training data to show the network examples of what to learn.
- To validate our training we split the training data into a training set and a test set.
- To ensure the whole dataset can be used in training and testing we can train multiple times with different subsets of the data acting as training/testing data. This is called cross validation.
- Deep learning neural networks are a very powerful modern machine learning technique. Scikit-Learn does not support these but other libraries like Tensorflow do.
- Several companies now offer cloud APIs where we can train neural networks on powerful computers.
Ethics and the Implications of Machine LearningEthics and machine learningEthics of machine learning in research
- The results of machine learning reflect biases in the training and input data.
- Many machine learning algorithms can’t explain how they arrived at a decision.
- Machine learning can be used for unethical purposes.
- Consider the implications of false positives and false negatives.
Find out moreOther algorithms
- This course has only touched on a few areas of machine learning and is designed to teach you just enough to do something useful.
- Machine learning is a rapidly evolving field and new tools and techniques are constantly appearing.