Lesson 6: Unsupervised Learning and Dimensionality Reduction
Introduction
Welcome to Lesson 6! Today, we're diving into the fascinating world of unsupervised learning and dimensionality reduction. Unlike supervised learning where we have labeled data, unsupervised learning tries to find patterns in data without any pre-existing labels. It's like being a detective, looking for clues and connections in a mystery where you don't know the solution beforehand.
We'll cover three main topics: Clustering (specifically K-means), Principal Component Analysis (PCA), and Autoencoders. Don't worry if these terms sound complex - we'll break them down with simple analogies and hands-on examples.
1. Clustering with K-means
Imagine you're organizing a large box of mixed Lego pieces. You might naturally start grouping similar pieces together - all the red bricks in one pile, all the blue plates in another, and so on. This is essentially what clustering does with data.
K-means is one of the simplest and most popular clustering algorithms. Here's how it works:
- You decide how many groups (clusters) you want to create.
- The algorithm randomly places a 'centroid' (like a flag) for each cluster.
- Each data point is assigned to the nearest centroid.
- The centroids are moved to the average position of all points assigned to them.
- Steps 3 and 4 are repeated until the centroids stop moving significantly.
Here's a simple implementation of K-means clustering using scikit-learn:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(42)
X = np.random.rand(300, 2) * 10
# Perform K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
marker='x', s=200, linewidths=3, color='r')
plt.title('K-means Clustering')
plt.show()
Interactive Visualization: K-means Clustering
Let's visualize how K-means clustering works. Each point represents a data point, and the color represents its cluster:
2. Principal Component Analysis (PCA)
Imagine you're trying to take a photo of a 3D object for a 2D postcard. You'd want to find the angle that captures the most interesting features of the object. PCA does something similar with high-dimensional data - it finds the 'angles' (principal components) that capture the most information.
PCA is useful for:
- Reducing the number of features in your data while preserving most of the information
- Visualizing high-dimensional data in 2D or 3D
- Identifying the most important features in your dataset
Here's how to perform PCA using scikit-learn:
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 10)
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot the results
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title('PCA: Data in 2D')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
# Print explained variance ratio
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
Interactive Visualization: PCA
This visualization shows data points projected onto the two principal components found by PCA:
3. Autoencoders
Imagine you're an artist trying to capture the essence of a complex scene in a quick sketch. You focus on the most important details, leaving out the minor ones. When you later want to recreate the full scene, you use your sketch as a guide, filling in the details from memory. This is similar to how autoencoders work.
An autoencoder is a type of neural network that learns to:
- Compress (encode) input data into a lower-dimensional representation
- Reconstruct (decode) the input data from this representation
Autoencoders are useful for:
- Dimensionality reduction
- Feature learning
- Denoising data
- Generating new data similar to the training data
Here's a simple autoencoder implemented in PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
# Define the autoencoder architecture
class Autoencoder(nn.Module):
def __init__(self):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 32)
)
self.decoder = nn.Sequential(
nn.Linear(32, 64),
nn.ReLU(),
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, 784),
nn.Sigmoid()
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
# Create the model, loss function, and optimizer
model = Autoencoder()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters())
# Assume we have a DataLoader called 'train_loader' with MNIST data
# Training loop
num_epochs = 10
for epoch in range(num_epochs):
for data in train_loader:
img, _ = data
img = img.view(img.size(0), -1)
# Forward pass
output = model(img)
loss = criterion(output, img)
# Backward pass and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
# Now you can use model.encoder to get the compressed representation
# and model.decoder to reconstruct the image from the compressed representation
Interactive Visualization: Autoencoder Reconstruction
This visualization shows original data points and their reconstructed versions after passing through an autoencoder:
Challenge: Dimensionality Reduction on MNIST
Now it's your turn! Try to apply dimensionality reduction techniques to the MNIST dataset of handwritten digits. Here are some ideas:
- Use PCA to reduce the MNIST images (28x28 pixels) to 50 dimensions, then reconstruct them
- Build an autoencoder to compress MNIST images to 32 dimensions, then reconstruct them
- Compare the quality of reconstructed images from PCA and the autoencoder
- Use K-means to cluster the MNIST digits (either on raw pixel values or after PCA) and visualize the cluster centers
- Try to classify MNIST digits using the reduced representations from PCA or the autoencoder
This challenge will help you apply what you've learned to a real-world dataset and compare different dimensionality reduction techniques.
Practical Project: Image Compression with Autoencoders
Let's put our knowledge into practice by creating a simple image compression system using autoencoders. We'll use the CIFAR-10 dataset, which contains 32x32 color images.
Here's a step-by-step guide:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
# Define the autoencoder
class Autoencoder(nn.Module):
def __init__(self):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Conv2d(3, 16, 3, stride=2, padding=1), # 16x16x16
nn.ReLU(),
nn.Conv2d(16, 32, 3, stride=2, padding=1), # 32x8x8
nn.ReLU(),
nn.Conv2d(32, 64, 3, stride=2, padding=1), # 64x4x4
)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1), # 32x8x8
nn.ReLU(),
nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1, output_padding=1), # 16x16x16
nn.ReLU(),
nn.ConvTranspose2d(16, 3, 3, stride=2, padding=1, output_padding=1), # 3x32x32
nn.Sigmoid()
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
# Load and preprocess the CIFAR-10 dataset
transform = transforms.Compose([transforms.ToTensor()])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=128, shuffle=True)
# Initialize the model, loss function, and optimizer
model = Autoencoder()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters())
# Training loop
num_epochs = 10
for epoch in range(num_epochs):
for data in trainloader:
img, _ = data
output = model(img)
loss = criterion(output, img)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
# Visualize results
def imshow(img):
img = img.squeeze().permute(1, 2, 0)
plt.imshow(img)
# Get some random images
dataiter = iter(trainloader)
images, _ = next(dataiter)
# Reconstruct images
outputs = model(images)
# Plot original and reconstructed images
plt.figure(figsize=(12, 6))
for i in range(10):
plt.subplot(2, 10, i+1)
imshow(images[i])
plt.axis('off')
plt.subplot(2, 10, i+11)
imshow(outputs[i].detach())
plt.axis('off')
plt.show()
This code creates an autoencoder that compresses 32x32x3 images to 64x4x4 (1024 values) and then reconstructs them. The compression ratio is 3:1, which is significant for image data.
After running this code, you'll see a visualization of original and reconstructed images. Compare them to see how well the autoencoder preserved the important features of the images.
Key Takeaways
- Unsupervised learning finds patterns in data without pre-existing labels.
- K-means clustering groups similar data points together.
- PCA reduces data dimensionality while preserving important information.
- Autoencoders can compress and reconstruct data, useful for dimensionality reduction and feature learning.
- These techniques have wide applications in data compression, feature extraction, and anomaly detection.