RSM412H1 Lecture Notes - Lecture 10: Unsupervised Learning, Taxicab Geometry, Hierarchical Clustering
Document Summary
March 19, 2020: with unsupervised learning, there is no target. Don"t know what you are looking for. Can"t validate with mse, sse, etc: goal is t explain and develop insights, can be used for data reduction. Pca analysis: main application is image recognition. Divides data into groups so that data points in same group are similar to other data points in the group and dissimilar to data points in other groups. Defines clusters to minimize within-cluster variation and maximize between cluster variation. Hartigan-wong algorithm: within-cluster variation is sum of squared euclidean distances between items and corresponding centroid is data point belong to cluster: is mean value of points assigned to cluster. Optimal number of clusters is where bend in plot of k vs wss occurs. Dissimilarity matrix: matrix of distance between each pair of observations. Can give false confidence in compactness of cluster. Also good if features deviate significantly from normality. Calculate distance and assign each observation to closest centroid.