{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import division, print_function, absolute_import" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Unsupervised machine learning problems\n", "\n", "By Matthew J. Graham (California Institute of Technology)\n", "\n", "Acknowledgements to A. A. Miller and G. Cabrera.\n", "\n", "(c) 2017, California Institute of Technology\n", "\n", "Version 0.1\n", "\n", "Given a data set of point sources, a typical machine learning problem is how many different kinds of objects can we distinguish in it? This is best answered with unsupervised algorithms, which aim to identify hidden structures in data sets. Of course, interpretation of any such structures, i.e., whether they actually indicate physically meaningful correlations or relationships, is left to the domain expert. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplot.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Revision: scikit-learn\n", "\n", "You're going to use the Python package `scikit-learn` to attack a number of problems which are best handled with unsupervised machine learning algorithms (models). In scikit-learn, different models are accessed via the various modules with user-defined tuning parameters. Data (features) for the models are stored in a 2D numpy array, `X`, with each row representing an individual object and each column a specific feature. Unsupervised models are invoked by calling `.fit(X)`. Once a model has fit a data set, it can be used to generate observations for a new data set `Xnew` by calling `.predict(Xnew)`. Any other method calls are described in the documentation.\n", "\n", "## $k$-means clustering\n", "\n", "You were introduced to the $k$-means clustering algorithm in the first DSFP session in August 2016 (and in the lecture after lunch today) and its application to the gold standard Iris dataset. We're going to revisit a bit of that material first but moving straight away to an astronomical data set from SDSS: SDSS_colors.csv. \n", "\n", "Problem: So let's start with loading the data set and plotting the axes against each other to get a feel for what the data looks like (you should always do this as a matter of course when getting a new (or not so new) data set). " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"SDSS_colors.csv\")\n", "# complete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It certainly appears as though there is structure within this data set but the human brain is very good at finding patterns where there are none (see pareidolia). \n", "\n", "Problem: Let's see what it looks like when you've fit it with two different $k$-means models, one with 2 clusters and one with 3 clusters. Plot the results in the $g-r$ vs. $u-g$ plane (and don't forget to also plot the cluster centers).\n", "\n", "Hint - Remember the `.labels_` attribute of the `KMeans` object will return the clusters measured by the algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "\n", "Kcluster = KMeans( # complete\n", "# complete\n", " \n", "plt.figure()\n", "plt.scatter( # complete\n", "plt.xlabel('u - g')\n", "plt.ylabel('g - r')\n", " \n", "Kcluster = KMeans( # complete\n", "\n", "plt.figure()\n", "plt.scatter( # complete\n", "plt.xlabel('u - g')\n", "plt.ylabel('g - r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fine tuning the algorithm\n", "\n", "The `scikit-learn` implementation of $k$-means has various optional parameters that can be used to fine tune the algorithm. \n", "\n", "Problem: See how the results change if the $k=3$ cluster model is called with `n_init = 1` and `init = random` options instead. Use `rs` for the random state.\n", "\n", "Note: the respective defaults for these parameters are 10 and k-means++, respectively. Refer to the documentation to see why these are better choices than what you are just about to try." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "rs = 14\n", "Kcluster1 = KMeans(# complete\n", "\n", "plt.figure()\n", "plt.scatter(# complete\n", "plt.xlabel('u - g')\n", "plt.ylabel('g - r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rescaling the data\n", "\n", "By default, $k$-means uses Euclidean distance as its similarity metric and the magnitude of individual features can therefore have a strong effect on the final clustering outcome. \n", "\n", "Problem: Determine the mean, standard deviation, and minimum and maximum values of each feature in the data set. Do you think any one feature is particularly dominant?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(\"Feature\\t\\t\\tmean\\tstd\\tmin\\tmax\")\n", "\n", "for key in df.keys():\n", " print(\"{:s}\\t{:.2f}\\t{:.2f}\\t{:.2f}\\t{:.2f}\".format( # complete\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the ranges of the features are vastly different, it is common practice to rescale each feature, either to the range [0, 1] (known as normalization) or to have zero mean and unit variance (known as standardization - this assumes a Gaussian distribution for the feature). \n", "\n", "Problem: Try rescaling the data to see if it makes a difference to previous results.\n", "\n", "Hint - The '`StandardScaler()`' class within the `sklearn.preprocessing` module may prove useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler().fit_transform( # complete\n", " \n", "Kcluster = KMeans( # complete\n", " \n", "plt.figure()\n", "plt.scatter(# complete\n", "plt.xlabel('u - g')\n", "plt.ylabel('g - r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Different implementations\n", "\n", "`scikit-learn` also has an alternate implementation of the $k$-means algorithm, `MiniBatchKMeans` which is faster but at the expense of accuracy. \n", "\n", "Problem: See what the performance differences between the two implementations are and also whether different results are obtained.\n", "\n", "Hint - The iPython `%timeit` command is very good for performance evaluations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cluster import MiniBatchKMeans\n", "\n", "%timeit MiniBatchKMeans( # complete\n", " \n", "%timeit KMeans( # complete\n", " \n", "# complete\n", " \n", "plt.figure()\n", "plt.scatter(# complete\n", "plt.xlabel('u - g')\n", "plt.ylabel('g - r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many clusters?\n", "\n", "One of the issues with the $k$-means algorithm is that the number of clusters to use, $k$, is a parameter. There are a number of ways of determining this for a given data set. One of them is to calculate the mean silhouette for (a subset of) the data for a variety of $k$ values and pick the value closest to 1. \n", "\n", "Problem: Calculate the mean silhouette value for $k$ in the range 2 to 10 and select the most appropriate one.\n", "\n", "Hint - The `silhouette_score()` class within the `metrics` module may prove useful.\n", "\n", "Note: This will take some time to run" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import metrics\n", "\n", "# Select a random subset of the data\n", "N_small = 100\n", "dfSample = df[ # complete\n", " \n", "ss = []\n", "nn = [2,3,5,10]\n", " \n", "for n in # complete\n", " Kcluster = Kmeans( # complete\n", " \n", " ss.append(metrics.silhouette_score( # complete\n", " \n", "plt.figure()\n", "plt.scatter(# complete\n", "plt.xlabel('k')\n", "plt.ylabel('Mean silhouette')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DBSCAN\n", "\n", "DBSCAN is a highly popular clustering algorithm that does not require the number of clusters to be specified. Instead, it defines clusters according to two parameters: `minPts`, the minimum number of points necessary for a cluster, and $\\epsilon$, a distance measure. `scikit-learn` has an implementation of the algorithm as part of the `cluster` module: `DBSCAN`. $\\epsilon$ and `minPts` are set by the implementation parameters `eps` and `min_samples`, respectively. \n", "\n", "Problem: Cluster the SDSS color data with DBSCAN and see what effect changing the fine tuning parameters has on the results. How does DBSCAN compare to $k$-means? \n", "\n", "Note: DBSCAN labels outliers as -1 and so `plt.scatter()` will plot all these points as the same color." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cluster import DBSCAN\n", "\n", "dbs = DBSCAN( # complete\n", "dbs.fit( # complete\n", " \n", "# Class numbers\n", "labels = dbs.labels_\n", "unique_labels = np.unique(labels)\n", "count = [(labels == lab).sum() for lab in unique_labels]\n", " \n", "plt.figure()\n", "for i in range(len(unique_labels)):\n", " plt.plot( # complete\n", "\n", "plt.xlabel('u - g')\n", "plt.ylabel('g - r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fine tuning the algorithm\n", "\n", "A recommended approach for fine tuning `minPts` and $\\epsilon$ is to set `minPts` to (at least) the dimension of the feature set and plot a $k$-distance graph for the data with $k$ = `minPts`. The appropriate value for $\\epsilon$ is indicated by an elbow in the diagram. \n", "\n", "Problem: Try this for the data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.neighbors import NearestNeighbors\n", "\n", "N_neigh = # complete\n", "radius = # complete\n", "neigh = NearestNeighbors( # complete\n", "# complete\n", " \n", "dist, = neigh.kneighbors( # complete\n", " \n", "plt.figure()\n", "plt.scatter( # complete\n", "plt.xlabel('Points sorted by distance to %sth nearest neighbor' % N_neigh)\n", "plt.ylabel('%sth nearest neighbor distance' % N_neigh)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gaussian mixture models\n", "\n", "Gaussian mixture models (GMMs) are a popular choice for clustering data as well as density estimation.\n", "\n", "Problem: Fit a Gaussian mixture model to the data (assume a certain number of Gaussian components). Try different numbers of components.\n", "\n", "Hint: The `GaussianMixture` class may prove useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.mixture import GaussianMixture\n", "\n", "clf = GaussianMixture(n_components = # complete\n", "clf.fit( # complete\n", " \n", "plt.figure()\n", "plt.scatter( # complete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hierarchical clustering\n", "\n", "Hierarchical clustering algorithms are an alternative to partition clustering ones, such as $k$-means and DBSCAN. \n", "\n", "Problem: Fit a hierarchical clustering algorithm to the data. How does it compare with $k$-means or DBSCAN?\n", "\n", "Hint: The `AgglomerativeClustering` class may prove useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.clustering import AgglomerativeClustering\n", "\n", "agg = AgglomerativeClustering( # complete\n", "agg.fit_predict( # complete\n", " \n", "plt.figure()\n", "plt.scatter( # complete\n", "plt.scatter( # complete\n", "plt.xlabel('u - g')\n", "plt.ylabel('g - r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Density estimation\n", "\n", "It's often unclear from simple scatter plots what the probability density function (pdf) for a data set actually looks like; density estimation algorithms attempt to recover it. Many of the basic aspects of density estimation were covered in the Visualization talk from Session 1 (Day 3) so we're going to focus more on some operational details.\n", "\n", "### Fine tuning histograms\n", "\n", "By far the simplest density estimation algorithm is the histogram and several prescriptions exist for the optimal bin width/number of bins to use. \n", "\n", "Problem: Using the SDSS galaxy data set, SDSS_gals.csv, show the redshift distribution with Scott's rule, Freedman-Diaconis rule, Knuth rule, and Bayesian blocks.\n", "\n", "Hint - The `hist` class in the module `astroML.plotting` may prove useful for Knuth rule and Bayesian blocks. `astroML` goes with the book by Ivezic, Connolly, Gray and VanderPlas and provides additional machine learning material for astronomy." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from astroML.plotting import hist\n", "\n", "df = pd.read_csv(\"SDSS_gals.csv\")\n", "z = df # complete\n", "\n", "fig, (ax0, ax1, ax2, ax3) = plt.subplots(1, 4, figsize = (17,5))\n", "\n", "hist(z # complete \n", "hist(z # complete \n", "hist(z # complete \n", "hist(z # complete " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Kernel density estimation\n", "\n", "Kernel density estimation fits a kernel to each data point and then estimates the pdf from the sum of all of these. \n", "\n", "Problem: Fit the data with a standard KDE. How des it compare against histograms?\n", "\n", "Hint - The `KernelDensity` class in the `sklearn.neighbors` module may be useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.neighbors import KernelDensity\n", "\n", "kde = KernelDensity( # complete\n", " \n", "# complete\n", " \n", "plt.figure()\n", "plt.scatter( # complete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The optimal bandwidth parameter, $h$, can be determined by cross-validation. \n", "\n", "Problem: Try it for the data set over $log(h)$ in the range [-3, 0]:\n", "\n", "Hint - The `GridSearchCV` class in the `sklearn.grid_search` module may be useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.grid_search import GridSearchCV\n", "\n", "# use grid search cross-validation to optimize the bandwidth\n", "params = {'bandwidth': # complete\n", "grid = GridSearchCV( # complete\n", "# complete\n", " \n", "# Extract the scores and plot them\n", "# complete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far we've just been looking at one dimension but now let's consider a two-dimensional case: the density of galaxies in terms of their radius (`petroRad_r_kpc`) and magnitude (`absPetroMag_r`). \n", "\n", "Problem: Repeat finding the optimal bandwidth via cross-validation and plot the resulting pdf. \n", "\n", "Note: The method `score_samples` gives results as $\\log$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = # complete\n", "\n", "# use grid search cross-validation to optimize the bandwidth\n", "params = {'bandwidth': # complete\n", "grid = GridSearchCV( # complete\n", "# complete\n", "\n", "x = np.arange(.0, 40, 1)\n", "y = np.arange(-24, -15, 0.2)\n", "X, Y = np.meshgrid(x, y)\n", "XY = np.array([X.flatten(), Y.flatten()]).transpose()\n", "pdf = # complete\n", " \n", "# Use imshow to make a color density plot\n", "plt.imshow(# complete, interpolation='none', cmap=pl.cm.jet, origin='lower',clip_on=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dimensional reduction\n", "\n", "There are a number of unsupervised learning algorithms that can produce a low-dimensional representation of a high-dimensional data set that preserves topological information, i.e., clustering. The state-of-the-art is the t-distributed stochastic neighborhood embedding (t-SNE). \n", "\n", "Problem: Fit the data set with a t-SNE (this may take some time as the algorithm is computationally intensive)\n", "\n", "Note: The `TSNE` class within the `sklearn.manifold` modules uses a slightly different syntax from the other `scikit-learn` implementations. The call is `.fit_transform(X)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "\n", "tsne = TSNE( # complete\n", "# complete\n", " \n", "plt.figure()\n", "plt.scatter( # complete" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }