# K Means Clustering

library(swirl)

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as

| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis

2: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

```
1: Principles of Analytic Graphs 2: Exploratory Graphs
3: Graphics Devices in R 4: Plotting Systems
5: Base Plotting System 6: Lattice Plotting System
7: Working with Colors 8: GGPlot2 Part1
9: GGPlot2 Part2 10: GGPlot2 Extras
11: Hierarchical Clustering 12: K Means Clustering
13: Dimension Reduction 14: Clustering Example
15: CaseStudy
```

Selection: 12

| Attempting to load lesson dependencies...

| Package ‘ggplot2’ loaded correctly!

| Package ‘fields’ loaded correctly!

| Package ‘jpeg’ loaded correctly!

| Package ‘datasets’ loaded correctly!

| | 0%

| K_Means_Clustering. (Slides for this and other Data Science courses may be found at

| github https://github.com/DataScienceSpecialization/courses/. If you care to use

| them, they must be downloaded as a zip file and viewed locally. This lesson

| corresponds to 04_ExploratoryAnalysis/kmeansClustering.)

...

|== | 2%

| In this lesson we'll learn about k-means clustering, another simple way of examining

| and organizing multi-dimensional data. As with hierarchical clustering, this

| technique is most useful in the early stages of analysis when you're trying to get

| an understanding of the data, e.g., finding some pattern or relationship between

| different factors or variables.

...

|=== | 4%

| R documentation tells us that the k-means method "aims to partition the points into

| k groups such that the sum of squares from points to the assigned cluster centres is

| minimized."

...

|===== | 6%

| Since clustering organizes data points that are close into groups we'll assume we've

| decided on a measure of distance, e.g., Euclidean.

...

|====== | 8%

| To illustrate the method, we'll use these random points we generated, familiar to

| you if you've already gone through the hierarchical clustering lesson. We'll

| demonstrate k-means clustering in several steps, but first we'll explain the general

| idea.

...

|======== | 10%

| As we said, k-means is a partioning approach which requires that you first guess how

| many clusters you have (or want). Once you fix this number, you randomly create a

| "centroid" (a phantom point) for each cluster and assign each point or observation

| in your dataset to the centroid to which it is closest. Once each point is assigned

| a centroid, you readjust the centroid's position by making it the average of the

| points assigned to it.

...

|========= | 12%

| Once you have repositioned the centroids, you must recalculate the distance of the

| observations to the centroids and reassign any, if necessary, to the centroid

| closest to them. Again, once the reassignments are done, readjust the positions of

| the centroids based on the new cluster membership. The process stops once you reach

| an iteration in which no adjustments are made or when you've reached some

| predetermined maximum number of iterations.

...

|=========== | 14%

| As described, what does this process require?

1: All of the others

2: A number of clusters

3: A defined distance metric

4: An initial guess as to cluster centroids

Selection: 1

| That's the answer I was looking for.

|============ | 16%

| So k-means clustering requires some distance metric (say Euclidean), a hypothesized

| fixed number of clusters, and an initial guess as to cluster centroids. As

| described, what does this process produce?

1: All of the others

2: An assignment of each point to a cluster

3: A final estimate of cluster centroids

Selection: 1

| You nailed it! Good job!

|============== | 18%

| When it's finished k-means clustering returns a final position of each cluster's

| centroid as well as the assignment of each data point or observation to a cluster.

...

|=============== | 20%

| Now we'll step through this process using our random points as our data. The

| coordinates of these are stored in 2 vectors, x and y. We eyeball the display and

| guess that there are 3 clusters. We'll pick 3 positions of centroids, one for each

| cluster.

...

|================= | 22%

| We've created two 3-long vectors for you, cx and cy. These respectively hold the x-

| and y- coordinates for 3 proposed centroids. For convenience, we've also stored them

| in a 2 by 3 matrix cmat. The x coordinates are in the first row and the y

| coordinates in the second. Look at cmat now.

cmat

`[,1] [,2] [,3] [1,] 1 1.8 2.5 [2,] 2 1.0 1.5`

| Excellent work!

|================== | 24%

| The coordinates of these points are (1,2), (1.8,1) and (2.5,1.5). We'll add these

| centroids to the plot of our points. Do this by calling the R command points with 6

| arguments. The first 2 are cx and cy, and the third is col set equal to the

| concatenation of 3 colors, "red", "orange", and "purple". The fourth argument is pch

| set equal to 3 (a plus sign), the fifth is cex set equal to 2 (expansion of

| character), and the final is lwd (line width) also set equal to 2.

points(cx,cy,col=c("red","orange","purple"),pch=3,cex=2,lwd=2)

| You nailed it! Good job!

|==================== | 26%

| We see the first centroid (1,2) is in red. The second (1.8,1), to the right and

| below the first, is orange, and the final centroid (2.5,1.5), the furthest to the

| right, is purple.

...

|====================== | 28%

| Now we have to calculate distances between each point and every centroid. There are

| 12 data points and 3 centroids. How many distances do we have to calculate?

1: 15

2: 108

3: 36

4: 9

Selection: 3

| You are amazing!

|======================= | 30%

| We've written a function for you called mdist which takes 4 arguments. The vectors

| of data points (x and y) are the first two and the two vectors of centroid

| coordinates (cx and cy) are the last two. Call mdist now with these arguments.

mdist(x,y,cx,cy)

`[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.392885 0.9774614 0.7000680 1.264693 1.1894610 1.2458771 0.8113513 1.026750 [2,] 1.108644 0.5544675 0.3768445 1.611202 0.8877373 0.7594611 0.7003994 2.208006 [3,] 3.461873 2.3238956 1.7413021 4.150054 0.3297843 0.2600045 0.4887610 1.337896 [,9] [,10] [,11] [,12] [1,] 4.5082665 4.5255617 4.8113368 4.0657750 [2,] 1.1825265 1.0540994 1.2278193 1.0090944 [3,] 0.3737554 0.4614472 0.5095428 0.2567247`

| Excellent job!

|========================= | 32%

| We've stored these distances in the matrix distTmp for you. Now we have to assign a

| cluster to each point. To do that we'll look at each column and ?

1: add up the 3 entries.

2: pick the minimum entry

3: pick the maximum entry

Selection: 2

| Perseverance, that's the answer.

|========================== | 34%

| From the distTmp entries, which cluster would point 6 be assigned to?

1: none of the above

2: 3

3: 2

4: 1

Selection: 2

| Keep working like that and you'll get there!

|============================ | 36%

| R has a handy function which.min which you can apply to ALL the columns of distTmp

| with one call. Simply call the R function apply with 3 arguments. The first is

| distTmp, the second is 2 meaning the columns of distTmp, and the third is which.min,

| the function you want to apply to the columns of distTmp. Try this now.

apply(distTmp,2,which.min)

[1] 2 2 2 1 3 3 3 1 3 3 3 3

| You are really on a roll!

|============================= | 38%

| You can see that you were right and the 6th entry is indeed 3 as you answered

| before. We see the first 3 entries were assigned to the second (orange) cluster and

| only 2 points (4 and 8) were assigned to the first (red) cluster.

...

|=============================== | 40%

| We've stored the vector of cluster colors ("red","orange","purple") in the array

| cols1 for you and we've also stored the cluster assignments in the array newClust.

| Let's color the 12 data points according to their assignments. Again, use the

| command points with 5 arguments. The first 2 are x and y. The third is pch set to

| 19, the fourth is cex set to 2, and the last, col is set to cols1[newClust].

points(x,y,pch=19,cex=2,col=cols1[newClust])

| Keep up the great work!

|================================ | 42%

| Now we have to recalculate our centroids so they are the average (center of gravity)

| of the cluster of points assigned to them. We have to do the x and y coordinates

| separately. We'll do the x coordinate first. Recall that the vectors x and y hold

| the respective coordinates of our 12 data points.

...

|================================== | 44%

| We can use the R function tapply which applies "a function over a ragged array".

| This means that every element of the array is assigned a factor and the function is

| applied to subsets of the array (identified by the factor vector). This allows us to

| take advantage of the factor vector newClust we calculated. Call tapply now with 3

| arguments, x (the data), newClust (the factor array), and mean (the function to

| apply).

tapply(x,newClust,mean)

`1 2 3 1.210767 1.010320 2.498011`

| Perseverance, that's the answer.

|=================================== | 46%

| Repeat the call, except now apply it to the vector y instead of x.

tapply(y,newClust,mean)

`1 2 3 1.730555 1.016513 1.354373`

| Your dedication is inspiring!

|===================================== | 48%

| Now that we have new x and new y coordinates for the 3 centroids we can plot them.

| We've stored off the coordinates for you in variables newCx and newCy. Use the R

| command points with these as the first 2 arguments. In addition, use the arguments

| col set equal to cols1, pch equal to 8, cex equal to 2 and lwd also equal to 2.

points(newCx,newCy,col=cols1,pch=8,cex=2,lwd=2)

| Keep up the great work!

|====================================== | 50%

| We see how the centroids have moved closer to their respective clusters. This is

| especially true of the second (orange) cluster. Now call the distance function mdist

| with the 4 arguments x, y, newCx, and newCy. This will allow us to reassign the data

| points to new clusters if necessary.

mdist(x,y,newCx,newCy)

`[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 0.98911875 0.539152725 0.2901879 1.0286979 0.7936966 0.8004956 0.4650664 1.028698 [2,] 0.09287262 0.002053041 0.0734304 0.2313694 1.9333732 1.8320407 1.4310971 2.926095 [3,] 3.28531180 2.197487387 1.6676725 4.0113796 0.4652075 0.3721778 0.6043861 1.643033 [,9] [,10] [,11] [,12] [1,] 3.3053706 3.282778 3.5391512 2.9345445 [2,] 3.5224442 3.295301 3.5990955 3.2097944 [3,] 0.2586908 0.309730 0.3610747 0.1602755`

| Excellent work!

|======================================== | 52%

| We've stored off this new matrix of distances in the matrix distTmp2 for you. Recall

| that the first cluster is red, the second orange and the third purple. Look closely

| at columns 4 and 7 of distTmp2. What will happen to points 4 and 7?

1: They will both change to cluster 2

2: Nothing

3: They're the only points that won't change clusters

4: They will both change clusters

Selection: 4

| You nailed it! Good job!

|========================================== | 54%

| Now call apply with 3 arguments, distTmp2, 2, and which.min to find the new cluster

| assignments for the points.

apply(distTmp2,2,which.min)

[1] 2 2 2 2 3 3 1 1 3 3 3 3

| That's a job well done!

|=========================================== | 56%

| We've stored off the new cluster assignments in a vector of factors called

| newClust2. Use the R function points to recolor the points with their new

| assignments. Again, there are 5 arguments, x and y are first, followed by pch set to

| 19, cex to 2, and col to cols1[newClust2].

points(x,y,pch=19,cex=2,col=cols1[newClust2])

| Keep working like that and you'll get there!

|============================================= | 58%

| Notice that points 4 and 7 both changed clusters, 4 moved from 1 to 2 (red to

| orange), and point 7 switched from 3 to 2 (purple to red).

...

|============================================== | 60%

| Now use tapply to find the x coordinate of the new centroid. Recall there are 3

| arguments, x, newClust2, and mean.

tapply(x,newClust2,mean)

`1 2 3 1.8878628 0.8904553 2.6001704`

| You're the best!

|================================================ | 62%

| Do the same to find the new y coordinate.

tapply(y,newClust2,mean)

`1 2 3 2.157866 1.006871 1.274675`

| Excellent work!

|================================================= | 64%

| We've stored off these coordinates for you in the variables finalCx and finalCy.

| Plot these new centroids using the points function with 6 arguments. The first 2 are

| finalCx and finalCy. The argument col should equal cols1, pch should equal 9, cex 2

| and lwd 2.

points(finalCx,finalCy,col=cols1,pch=9,cex=2,lwd=2)

| You nailed it! Good job!

|=================================================== | 66%

| It should be obvious that if we continued this process points 5 through 8 would all

| turn red, while points 1 through 4 stay orange, and points 9 through 12 purple.

...

|==================================================== | 68%

| Now that you've gone through an example step by step, you'll be relieved to hear

| that R provides a command to do all this work for you. Unsurprisingly it's called

| kmeans and, although it has several parameters, we'll just mention four. These are

| x, (the numeric matrix of data), centers, iter.max, and nstart. The second of these

| (centers) can be either a number of clusters or a set of initial centroids. The

| third, iter.max, specifies the maximum number of iterations to go through, and

| nstart is the number of random starts you want to try if you specify centers as a

| number.

...

|====================================================== | 70%

| Call kmeans now with 2 arguments, dataFrame (which holds the x and y coordinates of

| our 12 points) and centers set equal to 3.

kmeans(dataFrame,centers=3)

``` K-means clustering with 3 clusters of sizes 4, 4, 4

Cluster means:

```
x y
```

1 0.8904553 1.0068707

2 2.8534966 0.9831222

3 1.9906904 2.0078229

Clustering vector:

[1] 1 1 1 1 3 3 3 3 2 2 2 2

Within cluster sum of squares by cluster:

[1] 0.34188313 0.03298027 0.34732441

(between_SS / total_SS = 93.6 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"

[6] "betweenss" "size" "iter" "ifault"

```

| You are really on a roll!

|======================================================= | 72%

| The program returns the information that the data clustered into 3 clusters each of

| size 4. It also returns the coordinates of the 3 cluster means, a vector named

| cluster indicating how the 12 points were partitioned into the clusters, and the sum

| of squares within each cluster. It also shows all the available components returned

| by the function. We've stored off this data for you in a kmeans object called kmObj.

| Look at kmObj$iter to see how many iterations the algorithm went through.

kmObj$iter

[1] 1

| You got it!

|========================================================= | 74%

| Two iterations as we did before. We just want to emphasize how you can access the

| information available to you. Let's plot the data points color coded according to

| their cluster. This was stored in kmObj$cluster. Run plot with 5 arguments. The
| data, x and y, are the first two; the third, col is set equal to kmObj$cluster, and

| the last two are pch and cex. The first of these should be set to 19 and the last to

| 2.

plot(x,y,col=kmObj$cluster,pch=19,cex=2)

| You are doing so well!

|=========================================================== | 76%

| Now add the centroids which are stored in kmObj$centers. Use the points function
| with 5 arguments. The first two are kmObj$centers and col=c("black","red","green").

| The last three, pch, cex, and lwd, should all equal 3.

points(kmObj$centers,col=c("black","red","green"),pch=3,cex=3,lwd=3)

| Excellent work!

|============================================================ | 78%

| Now for some fun! We want to show you how the output of the kmeans function is

| affected by its random start (when you just ask for a number of clusters). With

| random starts you might want to run the function several times to get an idea of the

| relationships between your observations. We'll call kmeans with the same data points

| (stored in dataFrame), but ask for 6 clusters instead of 3.

...

|============================================================== | 80%

| We'll plot our data points several times and each time we'll just change the

| argument col which will show us how the R function kmeans is clustering them. So,

| call plot now with 5 arguments. The first 2 are x and y. The third is col set equal

| to the call kmeans(dataFrame,6)$cluster. The last two (pch and cex) are set to 19

| and 2 respectively.

plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

| You nailed it! Good job!

|=============================================================== | 82%

| See how the points cluster? Now recall your last command and rerun it.

plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

| Nice work!

|================================================================= | 84%

| See how the clustering has changed? As the Teletubbies would say, "Again! Again!"

plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

| That's the answer I was looking for.

|================================================================== | 86%

| So the clustering changes with different starts. Perhaps 6 is too many clusters?

| Let's review!

...

|==================================================================== | 88%

| True or False? K-means clustering requires you to specify a number of clusters

| before you begin.

1: False

2: True

Selection: 2

| You nailed it! Good job!

|===================================================================== | 90%

| True or False? K-means clustering requires you to specify a number of iterations

| before you begin.

1: True

2: False

Selection: 2

| You got it right!

|======================================================================= | 92%

| True or False? Every data set has a single fixed number of clusters.

1: False

2: True

Selection: 1

| Excellent job!

|======================================================================== | 94%

| True or False? K-means clustering will always stop in 3 iterations

1: True

2: False

Selection: 2

| You are really on a roll!

|========================================================================== | 96%

| True or False? When starting kmeans with random centroids, you'll always end up with

| the same final clustering.

1: False

2: True

Selection: 1

| Great job!

|=========================================================================== | 98%

| Congratulations! We hope this means you found this lesson oK.

...

|=============================================================================| 100%

| Would you like to receive credit for completing this course on Coursera.org?

1: No

2: Yes

Selection: 2

What is your email address? [email protected]

What is your assignment token? xXxXxxXXxXxxXXXx

Grade submission succeeded!

| Excellent work!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis

2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

*Last updated 2020-05-09 20:59:57.636628 IST*

## Comments