# Statistical Inference - Introduction

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

library(swirl)

| Hi! Type swirl() when you are ready to begin.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as
| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Statistical Inference
3: Take me to the swirl course repository!

Selection: 2

 1: Introduction             2: Probability1             3: Probability2
4: ConditionalProbability   5: Expectations             6: Variance
7: CommonDistros            8: Asymptotics              9: T Confidence Intervals
10: Hypothesis Testing      11: P Values                12: Power
13: Multiple Testing        14: Resampling

Selection: 1

| | 0%

| Introduction to Statistical_Inference. (Slides for this and other Data Science
| courses may be found at github https://github.com/DataScienceSpecialization/courses.
| If you care to use them, they must be downloaded as a zip file and viewed locally.
| This lesson corresponds to Statistical_Inference/Introduction.)

...

|======== | 10%
| In this lesson, we'll briefly introduce basics of statistical inference, the process
| of drawing conclusions "about a population using noisy statistical data where
| uncertainty must be accounted for". In other words, statistical inference lets
| scientists formulate conclusions from data and quantify the uncertainty arising from
| using incomplete data.

...

|=============== | 20%
| Which of the following is NOT an example of statistical inference?

1: Polling before an election to predict its outcome
2: Recording the results of a statistics exam
3: Testing the efficacy of a new drug
4: Constructing a medical image from fMRI data

Selection: 2

| All that hard work is paying off!

|======================= | 30%
| So statistical inference involves formulating conclusions using data AND quantifying
| the uncertainty associated with those conclusions. The uncertainty could arise from

...

|=============================== | 40%
| Which of the following would NOT be a source of bad data?

1: Small sample size
2: Selection bias
3: A randomly selected sample of population
4: A poorly designed study

Selection: 3

|====================================== | 50%
| So with statistical inference we use data to draw general conclusions about a
| population. Which of the following would a scientist using statistical inference
| techniques consider a problem?

1: Our study has no bias and is well-designed
2: Our data sample is representative of the population
3: Contaminated data

Selection: 3

| That's a job well done!

|============================================== | 60%
| Which of the following is NOT an example of statistical inference in action?

1: Testing the effectiveness of a medical treatment
2: Estimating the proportion of people who will vote for a candidate
3: Determining a causative mechanism underlying a disease
4: Counting sheep

Selection: 4

| You got it right!

|====================================================== | 70%
| We want to emphasize a couple of important points here. First, a statistic
| (singular) is a number computed from a sample of data. We use statistics to infer
| information about a population. Second, a random variable is an outcome from an
| experiment. Deterministic processes, such as computing means or variances, applied
| to random variables, produce additional random variables which have their own
| distributions. It's important to keep straight which distributions you're talking

...

|============================================================== | 80%
| Finally, there are two broad flavors of inference. The first is frequency, which
| uses "long run proportion of times an event occurs in independent, identically
| distributed repetitions." The second is Bayesian in which the probability estimate
| for a hypothesis is updated as additional evidence is acquired. Both flavors require
| an understanding of probability so that's what the next lessons will cover.

...

|===================================================================== | 90%
| Congrats! You've concluded this brief introduction to statistical inference.

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: No
2: Yes

Selection: 2
What is your assignment token? xXxXxxXXxXxxXXXx

| Great job!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Statistical Inference
3: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

Last updated 2020-05-26 15:45:29.101530 IST

# K Means Clustering

library(swirl)
swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as
| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

 1: Principles of Analytic Graphs   2: Exploratory Graphs
3: Graphics Devices in R           4: Plotting Systems
5: Base Plotting System            6: Lattice Plotting System
7: Working with Colors             8: GGPlot2 Part1
9: GGPlot2 Part2                  10: GGPlot2 Extras
11: Hierarchical Clustering        12: K Means Clustering
13: Dimension Reduction            14: Clustering Example
15: CaseStudy

Selection: 12

| Attempting to load lesson dependencies...

| | 0%

| K_Means_Clustering. (Slides for this and other Data Science courses may be found at
| github https://github.com/DataScienceSpecialization/courses/. If you care to use
| them, they must be downloaded as a zip file and viewed locally. This lesson
| corresponds to 04_ExploratoryAnalysis/kmeansClustering.)

...

|== | 2%
| In this lesson we'll learn about k-means clustering, another simple way of examining
| and organizing multi-dimensional data. As with hierarchical clustering, this
| technique is most useful in the early stages of analysis when you're trying to get
| an understanding of the data, e.g., finding some pattern or relationship between
| different factors or variables.

...

|=== | 4%
| R documentation tells us that the k-means method "aims to partition the points into
| k groups such that the sum of squares from points to the assigned cluster centres is
| minimized."

...

|===== | 6%
| Since clustering organizes data points that are close into groups we'll assume we've
| decided on a measure of distance, e.g., Euclidean.

...

|====== | 8%
| To illustrate the method, we'll use these random points we generated, familiar to
| you if you've already gone through the hierarchical clustering lesson. We'll
| demonstrate k-means clustering in several steps, but first we'll explain the general
| idea.

...

|======== | 10%
| As we said, k-means is a partioning approach which requires that you first guess how
| many clusters you have (or want). Once you fix this number, you randomly create a
| "centroid" (a phantom point) for each cluster and assign each point or observation
| in your dataset to the centroid to which it is closest. Once each point is assigned
| a centroid, you readjust the centroid's position by making it the average of the
| points assigned to it.

...

|========= | 12%
| Once you have repositioned the centroids, you must recalculate the distance of the
| observations to the centroids and reassign any, if necessary, to the centroid
| closest to them. Again, once the reassignments are done, readjust the positions of
| the centroids based on the new cluster membership. The process stops once you reach
| an iteration in which no adjustments are made or when you've reached some
| predetermined maximum number of iterations.

...

|=========== | 14%
| As described, what does this process require?

1: All of the others
2: A number of clusters
3: A defined distance metric
4: An initial guess as to cluster centroids

Selection: 1

| That's the answer I was looking for.

|============ | 16%
| So k-means clustering requires some distance metric (say Euclidean), a hypothesized
| fixed number of clusters, and an initial guess as to cluster centroids. As
| described, what does this process produce?

1: All of the others
2: An assignment of each point to a cluster
3: A final estimate of cluster centroids

Selection: 1

| You nailed it! Good job!

|============== | 18%
| When it's finished k-means clustering returns a final position of each cluster's
| centroid as well as the assignment of each data point or observation to a cluster.

...

|=============== | 20%
| Now we'll step through this process using our random points as our data. The
| coordinates of these are stored in 2 vectors, x and y. We eyeball the display and
| guess that there are 3 clusters. We'll pick 3 positions of centroids, one for each
| cluster.

...

|================= | 22%
| We've created two 3-long vectors for you, cx and cy. These respectively hold the x-
| and y- coordinates for 3 proposed centroids. For convenience, we've also stored them
| in a 2 by 3 matrix cmat. The x coordinates are in the first row and the y
| coordinates in the second. Look at cmat now.

cmat

     [,1] [,2] [,3]
[1,]    1  1.8  2.5
[2,]    2  1.0  1.5

| Excellent work!

|================== | 24%
| The coordinates of these points are (1,2), (1.8,1) and (2.5,1.5). We'll add these
| centroids to the plot of our points. Do this by calling the R command points with 6
| arguments. The first 2 are cx and cy, and the third is col set equal to the
| concatenation of 3 colors, "red", "orange", and "purple". The fourth argument is pch
| set equal to 3 (a plus sign), the fifth is cex set equal to 2 (expansion of
| character), and the final is lwd (line width) also set equal to 2.

points(cx,cy,col=c("red","orange","purple"),pch=3,cex=2,lwd=2)

| You nailed it! Good job!

|==================== | 26%
| We see the first centroid (1,2) is in red. The second (1.8,1), to the right and
| below the first, is orange, and the final centroid (2.5,1.5), the furthest to the
| right, is purple.

...

|====================== | 28%
| Now we have to calculate distances between each point and every centroid. There are
| 12 data points and 3 centroids. How many distances do we have to calculate?

1: 15
2: 108
3: 36
4: 9

Selection: 3

| You are amazing!

|======================= | 30%
| We've written a function for you called mdist which takes 4 arguments. The vectors
| of data points (x and y) are the first two and the two vectors of centroid
| coordinates (cx and cy) are the last two. Call mdist now with these arguments.

mdist(x,y,cx,cy)

         [,1]      [,2]      [,3]     [,4]      [,5]      [,6]      [,7]     [,8]
[1,] 1.392885 0.9774614 0.7000680 1.264693 1.1894610 1.2458771 0.8113513 1.026750
[2,] 1.108644 0.5544675 0.3768445 1.611202 0.8877373 0.7594611 0.7003994 2.208006
[3,] 3.461873 2.3238956 1.7413021 4.150054 0.3297843 0.2600045 0.4887610 1.337896
[,9]     [,10]     [,11]     [,12]
[1,] 4.5082665 4.5255617 4.8113368 4.0657750
[2,] 1.1825265 1.0540994 1.2278193 1.0090944
[3,] 0.3737554 0.4614472 0.5095428 0.2567247

| Excellent job!

|========================= | 32%
| We've stored these distances in the matrix distTmp for you. Now we have to assign a
| cluster to each point. To do that we'll look at each column and ?

1: add up the 3 entries.
2: pick the minimum entry
3: pick the maximum entry

Selection: 2

|========================== | 34%
| From the distTmp entries, which cluster would point 6 be assigned to?

1: none of the above
2: 3
3: 2
4: 1

Selection: 2

| Keep working like that and you'll get there!

|============================ | 36%
| R has a handy function which.min which you can apply to ALL the columns of distTmp
| with one call. Simply call the R function apply with 3 arguments. The first is
| distTmp, the second is 2 meaning the columns of distTmp, and the third is which.min,
| the function you want to apply to the columns of distTmp. Try this now.

apply(distTmp,2,which.min)
[1] 2 2 2 1 3 3 3 1 3 3 3 3

| You are really on a roll!

|============================= | 38%
| You can see that you were right and the 6th entry is indeed 3 as you answered
| before. We see the first 3 entries were assigned to the second (orange) cluster and
| only 2 points (4 and 8) were assigned to the first (red) cluster.

...

|=============================== | 40%
| We've stored the vector of cluster colors ("red","orange","purple") in the array
| cols1 for you and we've also stored the cluster assignments in the array newClust.
| Let's color the 12 data points according to their assignments. Again, use the
| command points with 5 arguments. The first 2 are x and y. The third is pch set to
| 19, the fourth is cex set to 2, and the last, col is set to cols1[newClust].

points(x,y,pch=19,cex=2,col=cols1[newClust])

| Keep up the great work!

|================================ | 42%
| Now we have to recalculate our centroids so they are the average (center of gravity)
| of the cluster of points assigned to them. We have to do the x and y coordinates
| separately. We'll do the x coordinate first. Recall that the vectors x and y hold
| the respective coordinates of our 12 data points.

...

|================================== | 44%
| We can use the R function tapply which applies "a function over a ragged array".
| This means that every element of the array is assigned a factor and the function is
| applied to subsets of the array (identified by the factor vector). This allows us to
| take advantage of the factor vector newClust we calculated. Call tapply now with 3
| arguments, x (the data), newClust (the factor array), and mean (the function to
| apply).

tapply(x,newClust,mean)

       1        2        3
1.210767 1.010320 2.498011

|=================================== | 46%
| Repeat the call, except now apply it to the vector y instead of x.

tapply(y,newClust,mean)

       1        2        3
1.730555 1.016513 1.354373

|===================================== | 48%
| Now that we have new x and new y coordinates for the 3 centroids we can plot them.
| We've stored off the coordinates for you in variables newCx and newCy. Use the R
| command points with these as the first 2 arguments. In addition, use the arguments
| col set equal to cols1, pch equal to 8, cex equal to 2 and lwd also equal to 2.

points(newCx,newCy,col=cols1,pch=8,cex=2,lwd=2)

| Keep up the great work!

|====================================== | 50%
| We see how the centroids have moved closer to their respective clusters. This is
| especially true of the second (orange) cluster. Now call the distance function mdist
| with the 4 arguments x, y, newCx, and newCy. This will allow us to reassign the data
| points to new clusters if necessary.

mdist(x,y,newCx,newCy)

           [,1]        [,2]      [,3]      [,4]      [,5]      [,6]      [,7]     [,8]
[1,] 0.98911875 0.539152725 0.2901879 1.0286979 0.7936966 0.8004956 0.4650664 1.028698
[2,] 0.09287262 0.002053041 0.0734304 0.2313694 1.9333732 1.8320407 1.4310971 2.926095
[3,] 3.28531180 2.197487387 1.6676725 4.0113796 0.4652075 0.3721778 0.6043861 1.643033
[,9]    [,10]     [,11]     [,12]
[1,] 3.3053706 3.282778 3.5391512 2.9345445
[2,] 3.5224442 3.295301 3.5990955 3.2097944
[3,] 0.2586908 0.309730 0.3610747 0.1602755

| Excellent work!

|======================================== | 52%
| We've stored off this new matrix of distances in the matrix distTmp2 for you. Recall
| that the first cluster is red, the second orange and the third purple. Look closely
| at columns 4 and 7 of distTmp2. What will happen to points 4 and 7?

1: They will both change to cluster 2
2: Nothing
3: They're the only points that won't change clusters
4: They will both change clusters

Selection: 4

| You nailed it! Good job!

|========================================== | 54%
| Now call apply with 3 arguments, distTmp2, 2, and which.min to find the new cluster
| assignments for the points.

apply(distTmp2,2,which.min)
[1] 2 2 2 2 3 3 1 1 3 3 3 3

| That's a job well done!

|=========================================== | 56%
| We've stored off the new cluster assignments in a vector of factors called
| newClust2. Use the R function points to recolor the points with their new
| assignments. Again, there are 5 arguments, x and y are first, followed by pch set to
| 19, cex to 2, and col to cols1[newClust2].

points(x,y,pch=19,cex=2,col=cols1[newClust2])

| Keep working like that and you'll get there!

|============================================= | 58%
| Notice that points 4 and 7 both changed clusters, 4 moved from 1 to 2 (red to
| orange), and point 7 switched from 3 to 2 (purple to red).

...

|============================================== | 60%
| Now use tapply to find the x coordinate of the new centroid. Recall there are 3
| arguments, x, newClust2, and mean.

tapply(x,newClust2,mean)

        1         2         3
1.8878628 0.8904553 2.6001704

| You're the best!

|================================================ | 62%
| Do the same to find the new y coordinate.

tapply(y,newClust2,mean)

       1        2        3
2.157866 1.006871 1.274675

| Excellent work!

|================================================= | 64%
| We've stored off these coordinates for you in the variables finalCx and finalCy.
| Plot these new centroids using the points function with 6 arguments. The first 2 are
| finalCx and finalCy. The argument col should equal cols1, pch should equal 9, cex 2
| and lwd 2.

points(finalCx,finalCy,col=cols1,pch=9,cex=2,lwd=2)

| You nailed it! Good job!

|=================================================== | 66%
| It should be obvious that if we continued this process points 5 through 8 would all
| turn red, while points 1 through 4 stay orange, and points 9 through 12 purple.

...

|==================================================== | 68%
| Now that you've gone through an example step by step, you'll be relieved to hear
| that R provides a command to do all this work for you. Unsurprisingly it's called
| kmeans and, although it has several parameters, we'll just mention four. These are
| x, (the numeric matrix of data), centers, iter.max, and nstart. The second of these
| (centers) can be either a number of clusters or a set of initial centroids. The
| third, iter.max, specifies the maximum number of iterations to go through, and
| nstart is the number of random starts you want to try if you specify centers as a
| number.

...

|====================================================== | 70%
| Call kmeans now with 2 arguments, dataFrame (which holds the x and y coordinates of
| our 12 points) and centers set equal to 3.

kmeans(dataFrame,centers=3)
 K-means clustering with 3 clusters of sizes 4, 4, 4

Cluster means:

      x         y


1 0.8904553 1.0068707
2 2.8534966 0.9831222
3 1.9906904 2.0078229

Clustering vector:
[1] 1 1 1 1 3 3 3 3 2 2 2 2

Within cluster sum of squares by cluster:
[1] 0.34188313 0.03298027 0.34732441
(between_SS / total_SS = 93.6 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

| You are really on a roll!

|======================================================= | 72%
| The program returns the information that the data clustered into 3 clusters each of
| size 4. It also returns the coordinates of the 3 cluster means, a vector named
| cluster indicating how the 12 points were partitioned into the clusters, and the sum
| of squares within each cluster. It also shows all the available components returned
| by the function. We've stored off this data for you in a kmeans object called kmObj.
| Look at kmObj$iter to see how many iterations the algorithm went through. kmObj$iter
[1] 1

| You got it!

|========================================================= | 74%
| Two iterations as we did before. We just want to emphasize how you can access the
| information available to you. Let's plot the data points color coded according to
| their cluster. This was stored in kmObj$cluster. Run plot with 5 arguments. The | data, x and y, are the first two; the third, col is set equal to kmObj$cluster, and
| the last two are pch and cex. The first of these should be set to 19 and the last to
| 2.

plot(x,y,col=kmObj$cluster,pch=19,cex=2) | You are doing so well! |=========================================================== | 76% | Now add the centroids which are stored in kmObj$centers. Use the points function | with 5 arguments. The first two are kmObj$centers and col=c("black","red","green"). | The last three, pch, cex, and lwd, should all equal 3. points(kmObj$centers,col=c("black","red","green"),pch=3,cex=3,lwd=3)

| Excellent work!

|============================================================ | 78%
| Now for some fun! We want to show you how the output of the kmeans function is
| affected by its random start (when you just ask for a number of clusters). With
| random starts you might want to run the function several times to get an idea of the
| relationships between your observations. We'll call kmeans with the same data points
| (stored in dataFrame), but ask for 6 clusters instead of 3.

...

|============================================================== | 80%
| We'll plot our data points several times and each time we'll just change the
| argument col which will show us how the R function kmeans is clustering them. So,
| call plot now with 5 arguments. The first 2 are x and y. The third is col set equal
| to the call kmeans(dataFrame,6)$cluster. The last two (pch and cex) are set to 19 | and 2 respectively. plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

| You nailed it! Good job!

|=============================================================== | 82%
| See how the points cluster? Now recall your last command and rerun it.

plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2) | Nice work! |================================================================= | 84% | See how the clustering has changed? As the Teletubbies would say, "Again! Again!" plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)

| That's the answer I was looking for.

|================================================================== | 86%
| So the clustering changes with different starts. Perhaps 6 is too many clusters?
| Let's review!

...

|==================================================================== | 88%
| True or False? K-means clustering requires you to specify a number of clusters
| before you begin.

1: False
2: True

Selection: 2

| You nailed it! Good job!

|===================================================================== | 90%
| True or False? K-means clustering requires you to specify a number of iterations
| before you begin.

1: True
2: False

Selection: 2

| You got it right!

|======================================================================= | 92%
| True or False? Every data set has a single fixed number of clusters.

1: False
2: True

Selection: 1

| Excellent job!

|======================================================================== | 94%
| True or False? K-means clustering will always stop in 3 iterations

1: True
2: False

Selection: 2

| You are really on a roll!

|========================================================================== | 96%
| True or False? When starting kmeans with random centroids, you'll always end up with
| the same final clustering.

1: False
2: True

Selection: 1

| Great job!

|=========================================================================== | 98%
| Congratulations! We hope this means you found this lesson oK.

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: No
2: Yes

Selection: 2
What is your assignment token? xXxXxxXXxXxxXXXx

| Excellent work!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

Last updated 2020-05-09 20:59:57.636628 IST

# Hierarchical Clustering

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

setwd("C:/images")
library(swirl)

| Hi! Type swirl() when you are ready to begin.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as
| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

 1: Principles of Analytic Graphs   2: Exploratory Graphs
3: Graphics Devices in R           4: Plotting Systems
5: Base Plotting System            6: Lattice Plotting System
7: Working with Colors             8: GGPlot2 Part1
9: GGPlot2 Part2                  10: GGPlot2 Extras
11: Hierarchical Clustering        12: K Means Clustering
13: Dimension Reduction            14: Clustering Example
15: CaseStudy

Selection: 11

| Attempting to load lesson dependencies...

| This lesson requires the ‘fields’ package. Would you like me to install it for you
| now?

1: Yes
2: No

Selection: 1

| Trying to install package ‘fields’ now...
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
also installing the dependencies ‘dotCall64’, ‘spam’, ‘maps’

package ‘dotCall64’ successfully unpacked and MD5 sums checked
package ‘spam’ successfully unpacked and MD5 sums checked
package ‘maps’ successfully unpacked and MD5 sums checked
package ‘fields’ successfully unpacked and MD5 sums checked

| | 0%

| Hierarchical_Clustering. (Slides for this and other Data Science courses may be
| found at github https://github.com/DataScienceSpecialization/courses/. If you care
| to use them, they must be downloaded as a zip file and viewed locally. This lesson
| corresponds to 04_ExploratoryAnalysis/hierarchicalClustering.)

...

|= | 2%
| In this lesson we'll learn about hierarchical clustering, a simple way of quickly
| examining and displaying multi-dimensional data. This technique is usually most
| useful in the early stages of analysis when you're trying to get an understanding of
| the data, e.g., finding some pattern or relationship between different factors or
| variables. As the name suggests hierarchical clustering creates a hierarchy of
| clusters.

...

|== | 3%
| Clustering organizes data points that are close into groups. So obvious questions
| are "How do we define close?", "How do we group things?", and "How do we interpret
| the grouping?" Cluster analysis is a very important topic in data analysis.

...

|==== | 5%
| To give you an idea of what we're talking about, consider these random points we
| generated. We'll use them to demonstrate hierarchical clustering in this lesson.
| We'll do this in several steps, but first we have to clarify our terms and concepts.

...

|===== | 6%
| Hierarchical clustering is an agglomerative, or bottom-up, approach. From Wikipedia
| (http://en.wikipedia.org/wiki/Hierarchical_clustering), we learn that in this
| method, "each observation starts in its own cluster, and pairs of clusters are
| merged as one moves up the hierarchy." This means that we'll find the closest two
| points and put them together in one cluster, then find the next closest pair in the
| updated picture, and so forth. We'll repeat this process until we reach a reasonable
| stopping place.

...

|====== | 8%
| Note the word "reasonable". There's a lot of flexibility in this field and how you
| perform your analysis depends on your problem. Again, Wikipedia tells us, "one can
| decide to stop clustering either when the clusters are too far apart to be merged
| (distance criterion) or when there is a sufficiently small number of clusters
| (number criterion)."

...

|======= | 10%
| First, how do we define close? This is the most important step and there are several
| possibilities depending on the questions you're trying to answer and the data you
| have. Distance or similarity are usually the metrics used.

...

|========= | 11%
| In the given plot which pair points would you first cluster? Use distance as the
| metric.

1: 5 and 6
2: 1 and 4
3: 10 and 12
4: 7 and 8

Selection: 1

|========== | 13%
| It's pretty obvious that out of the 4 choices, the pair 5 and 6 were the closest
| together. However, there are several ways to measure distance or similarity.
| Euclidean distance and correlation similarity are continuous measures, while
| Manhattan distance is a binary measure. In this lesson we'll just briefly discuss
| the first and last of these. It's important that you use a measure of distance that

...

|=========== | 15%
| Euclidean distance is what you learned about in high school algebra. Given two
| points on a plane, (x1,y1) and (x2,y2), the Euclidean distance is the square root of
| the sums of the squares of the distances between the two x-coordinates (x1-x2) and
| the two y-coordinates (y1-y2). You probably recognize this as an application of the
| Pythagorean theorem which yields the length of the hypotenuse of a right triangle.

...

|============ | 16%
| It shouldn't be hard to believe that this generalizes to more than two dimensions as
| shown in the formula at the bottom of the picture shown here.

...

|============== | 18%
| Euclidean distance is distance "as the crow flies". Many applications, however,
| can't realistically use crow-flying distance. Cars, for instance, have to follow

...

|=============== | 19%
| In this case, we can use Manhattan or city block distance (also known as a taxicab
| metric). This picture, copied from http://en.wikipedia.org/wiki/Taxicab_geometry,
| shows what this means.

...

|================ | 21%
| You want to travel from the point at the lower left to the one on the top right. The
| shortest distance is the Euclidean (the green line), but you're limited to the grid,
| so you have to follow a path similar to those shown in red, blue, or yellow. These
| all have the same length (12) which is the number of small gray segments covered by
| their paths.

...

|================= | 23%
| More formally, Manhattan distance is the sum of the absolute values of the distances
| between each coordinate, so the distance between the points (x1,y1) and (x2,y2) is
| |x1-x2|+|y1-y2|. As with Euclidean distance, this too generalizes to more than 2
| dimensions.

...

|=================== | 24%
| Now we'll go back to our random points. You might have noticed that these points
| don't really look randomly positioned, and in fact, they're not. They were actually
| generated as 3 distinct clusters. We've put the coordinates of these points in a
| data frame for you, called dataFrame.

...

|==================== | 26%
| We'll use this dataFrame to demonstrate an agglomerative (bottom-up) technique of
| hierarchical clustering and create a dendrogram. This is an abstract picture (or
| graph) which shows how the 12 points in our dataset cluster together. Two clusters
| (initially, these are points) that are close are connected with a line, We'll use
| Euclidean distance as our metric of closeness.

...

|===================== | 27%
| Run the R command dist with the argument dataFrame to compute the distances between
| all pairs of these points. By default dist uses Euclidean distance as its metric,
| but other metrics such as Manhattan, are available. Just use the default.

dist(dataFrame)

            1          2          3          4          5          6          7
2  0.34120511
3  0.57493739 0.24102750
4  0.26381786 0.52578819 0.71861759
5  1.69424700 1.35818182 1.11952883 1.80666768
6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268
7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557
8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931 0.65062566
9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116 1.28582631
10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201 1.32063059
11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373 1.37369662
12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231 1.17740375
8          9         10         11
2
3
4
5
6
7
8
9  1.76460709
10 1.83517785 0.14090406
11 1.86999431 0.11624471 0.08317570
12 1.66223814 0.10848966 0.19128645 0.20802789

| Great job!

|====================== | 29%
| You see that the output is a lower triangular matrix with rows numbered from 2 to 12
| and columns numbered from 1 to 11. Entry (i,j) indicates the distance between points
| i and j. Clearly you need only a lower triangular matrix since the distance between
| points i and j equals that between j and i.

...

|======================== | 31%
| From the output of dist, what is the minimum distance between two points?

1: 0.08317
2: -0.0700
3: 0.1085
4: 0.0815

Selection: 4

| Excellent job!

|========================= | 32%
| So 0.0815 (units are unspecified) between points 5 and 6 is the shortest distance.
| We can put these points in a single cluster and look for another close pair of
| points.

...

|========================== | 34%
| Looking at the picture, what would be another good pair of points to put in another
| cluster given that 5 and 6 are already clustered?

1: 7 and the cluster containing 5 ad 6
2: 10 and 11
3: 7 and 8
4: 1 and 4

Selection: 2

| You are amazing!

|=========================== | 35%
| So 10 and 11 are another pair of points that would be in a second cluster. We'll
| start creating our dendrogram now. Here're the original plot and two beginning
| pieces of the dendrogram.

...

|============================= | 37%
| We can keep going like this in the obvious way and pair up individual points, but as
| luck would have it, R provides a simple function which you can call which creates a
| dendrogram for you. It's called hclust() and takes as an argument the pairwise
| distance matrix which we looked at before. We've stored this matrix for you in a
| variable called distxy. Run hclust now with distxy as its argument and put the
| result in the variable hc.

hc<-hclust(distxy)

|============================== | 39%
| You're probably curious and want to see hc.

...

|=============================== | 40%
| Call the R function plot with one argument, hc.

plot(hc)

| That's correct!

|================================ | 42%
| Nice plot, right? R's plot conveniently labeled everything for you. The points we
| saw are the leaves at the bottom of the graph, 5 and 6 are connected, as are 10 and
| 11. Moreover, we see that the original 3 groupings of points are closest together as
| leaves on the picture. That's reassuring. Now call plot again, this time with the
| argument as.dendrogram(hc).

plot(as.dendrogram(hc))

| Keep up the great work!

|================================== | 44%
| The essentials are the same, but the labels are missing and the leaves (original
| points) are all printed at the same level. Notice that the vertical heights of the
| lines and labeling of the scale on the left edge give some indication of distance.
| Use the R command abline to draw a horizontal blue line at 1.5 on this plot. Recall
| that this requires 2 arguments, h=1.5 and col="blue".

abline(h=1.5,col="blue")

| Keep working like that and you'll get there!

|=================================== | 45%
| We see that this blue line intersects 3 vertical lines and this tells us that using
| the distance 1.5 (unspecified units) gives us 3 clusters (1 through 4), (9 through
| 12), and (5 through 8). We call this a "cut" of our dendrogram. Now cut the
| dendrogam by drawing a red horizontal line at .4.

abline(h=0.4,col="red")

| You are really on a roll!

|==================================== | 47%
| How many clusters are there with a cut at this distance?

5
[1] 5

| Keep up the great work!

|===================================== | 48%
| We see that by cutting at .4 we have 5 clusters, indicating that this distance is
| small enough to break up our original grouping of points. If we drew a horizontal
| line at .05, how many clusters would we get

5
[1] 5

| That's not exactly what I'm looking for. Try again. Or, type info() for more
| options.

| Recall that our shortest distance was around .08, so a distance smaller than that
| would make all the points their own private clusters.

12
[1] 12

|====================================== | 50%
| Try it now (draw a horizontal line at .05) and make the line green.

abline(h=0.05,col="green")

|======================================== | 52%
| So the number of clusters in your data depends on where you draw the line! (We said
| there's a lot of flexibility here.) Now that we've seen the practice, let's go back
| to some "theory". Notice that the two original groupings, 5 through 8, and 9 through
| 12, are connected with a horizontal line near the top of the display. You're
| probably wondering how distances between clusters of points are measured.

...

|========================================= | 53%
| There are several ways to do this. We'll just mention two. The first is called
| complete linkage and it says that if you're trying to measure a distance between two
| clusters, take the greatest distance between the pairs of points in those two
| clusters. Obviously such pairs contain one point from each cluster.

...

|========================================== | 55%
| So if we were measuring the distance between the two clusters of points (1 through
| 4) and (5 through 8), using complete linkage as the metric we would use the distance
| between points 4 and 8 as the measure since this is the largest distance between the
| pairs of those groups.

...

|=========================================== | 56%
| The distance between the two clusters of points (9 through 12) and (5 through 8),
| using complete linkage as the metric, is the distance between points 11 and 8 since
| this is the largest distance between the pairs of those groups.

...

|============================================= | 58%
| As luck would have it, the distance between the two clusters of points (9 through
| 12) and (1 through 4), using complete linkage as the metric, is the distance between
| points 11 and 4.

...

|============================================== | 60%
| We've created the dataframe dFsm for you containing these 3 points, 4, 8, and 11.
| Run dist on dFsm to see what the smallest distance between these 3 points is.

dist(dFsm)
1 2
2 2.028495
3 2.374620 1.869994

| Keep up the great work!

|=============================================== | 61%
| We see that the smallest distance is between points 2 and 3 in this reduced set,
| (these are actually points 8 and 11 in the original set), indicating that the two
| clusters these points represent ((5 through 8) and (9 through 12) respectively)
| would be joined (at a distance of 1.869) before being connected with the third
| cluster (1 through 4). This is consistent with the dendrogram we plotted.

...

|================================================ | 63%
| The second way to measure a distance between two clusters that we'll just mention is
| called average linkage. First you compute an "average" point in each cluster (think
| of it as the cluster's center of gravity). You do this by computing the mean
| (average) x and y coordinates of the points in the cluster.

...

|================================================== | 65%
| Then you compute the distances between each cluster average to compute the
| intercluster distance.

...

|=================================================== | 66%
| Now look at the hierarchical cluster we created before, hc.

hc

Call:
hclust(d = distxy)

Cluster method : complete
Distance : euclidean
Number of objects: 12

| Excellent work!

|==================================================== | 68%
| Which type of linkage did hclust() use to agglomerate clusters?

1: complete
2: average

Selection: 1

| You nailed it! Good job!

|===================================================== | 69%
| In our simple set of data, the average and complete linkages aren't that different,
| but in more complicated datasets the type of linkage you use could affect how your
| data clusters. It is a good idea to experiment with different methods of linkage to
| see the varying ways your data groups. This will help you determine the best way to

...

|======================================================= | 71%
| The last method of visualizing data we'll mention in this lesson concerns heat maps.
| Wikipedia (http://en.wikipedia.org/wiki/Heat_map) tells us a heat map is "a
| graphical representation of data where the individual values contained in a matrix
| are represented as colors. ... Heat maps originated in 2D displays of the values in
| a data matrix. Larger values were represented by small dark gray or black squares
| (pixels) and smaller values by lighter squares."

...

|======================================================== | 73%
| You've probably seen many examples of heat maps, for instance weather radar and
| displays of ocean salinity. From Wikipedia (http://en.wikipedia.org/wiki/Heat_map)
| we learn that heat maps are often used in molecular biology "to represent the level
| of expression of many genes across a number of comparable samples (e.g. cells in
| different states, samples from different patients) as they are obtained from DNA
| microarrays."

...

|========================================================= | 74%
| We won't say too much on this topic, but a very nice concise tutorial on creating
| heatmaps in R exists at
| http://sebastianraschka.com/Articles/heatmaps_in_r.html#clustering. Here's an image
| from the tutorial to start you thinking about the topic. It shows a sample heat map
| with a dendrogram on the left edge mapping the relationship between the rows. The
| legend at the top shows how colors relate to values.

...

|========================================================== | 76%
| R provides a handy function to produce heat maps. It's called heatmap. We've put the
| point data we've been using throughout this lesson in a matrix. Call heatmap now
| with 2 arguments. The first is dataMatrix and the second is col set equal to
| cm.colors(25). This last is optional, but we like the colors better than the default
| ones.

heatmap(dataMatrix,col=cm.colors(25))

| That's a job well done!

|============================================================ | 77%
| We see an interesting display of sorts. This is a very simple heat map - simple
| because the data isn't very complex. The rows and columns are grouped together as
| shown by colors. The top rows (labeled 5, 6, and 7) seem to be in the same group
| (same colors) while 8 is next to them but colored differently. This matches the
| dendrogram shown on the left edge. Similarly, 9, 12, 11, and 10 are grouped together
| (row-wise) along with 3 and 2. These are followed by 1 and 4 which are in a separate
| group. Column data is treated independently of rows but is also grouped.

...

|============================================================= | 79%
| We've subsetted some vehicle data from mtcars, the Motor Trend Car Road Tests which
| is part of the package datasets. The data is in the matrix mt and contains 6 factors
| of 11 cars. Run heatmap now with mt as its only argument.

heatmap(mt)

| Keep working like that and you'll get there!

|============================================================== | 81%
| This looks slightly more interesting than the heatmap for the point data. It shows a
| little better how the rows and columns are treated (clustered and colored)
| independently of one another. To understand the disparity in color (between the left
| 4 columns and the right 2) look at mt now.

mt

                  mpg cyl  disp  hp drat    wt
Dodge Challenger 15.5   8 318.0 150 2.76 3.520
AMC Javelin      15.2   8 304.0 150 3.15 3.435
Camaro Z28       13.3   8 350.0 245 3.73 3.840
Pontiac Firebird 19.2   8 400.0 175 3.08 3.845
Fiat X1-9        27.3   4  79.0  66 4.08 1.935
Porsche 914-2    26.0   4 120.3  91 4.43 2.140
Lotus Europa     30.4   4  95.1 113 3.77 1.513
Ford Pantera L   15.8   8 351.0 264 4.22 3.170
Ferrari Dino     19.7   6 145.0 175 3.62 2.770
Maserati Bora    15.0   8 301.0 335 3.54 3.570
Volvo 142E       21.4   4 121.0 109 4.11 2.780

| You're the best!

|=============================================================== | 82%
| See how four of the columns are all relatively small numbers and only two (disp and
| hp) are large? That explains the big difference in color columns. Now to understand
| the grouping of the rows call plot with one argument, the dendrogram object denmt
| we've created for you.

plot(denmt)

| Excellent work!

|================================================================= | 84%
| We see that this dendrogram is the one displayed at the side of the heat map. How
| was this created? Recall that we generalized the distance formula for more than 2
| dimensions. We've created a distance matrix for you, distmt. Look at it now.

distmt

                 Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
AMC Javelin              14.00890
Camaro Z28              100.27404   105.57041
Pontiac Firebird         85.80733    99.28330   86.22779
Fiat X1-9               253.64640   240.51305  325.11191        339.12867
Porsche 914-2           206.63309   193.29419  276.87318        292.15588  48.29642
Lotus Europa            226.48724   212.74240  287.59666        311.37656  49.78046
Ford Pantera L          118.69012   123.31494   19.20778        101.66275 336.65679
Ferrari Dino            174.86264   161.03078  216.72821        255.01117 127.67016
Maserati Bora           185.78176   185.02489  102.48902        188.19917 349.02042
Volvo 142E              201.35337   187.68535  266.49555        286.74036  60.40302
Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
AMC Javelin
Camaro Z28
Pontiac Firebird
Fiat X1-9
Porsche 914-2
Lotus Europa          33.75246
Ford Pantera L       288.56998    297.51961
Ferrari Dino          87.81135     80.33743      224.44761
Maserati Bora        303.85577    303.20992       86.84620    223.52346
Volvo 142E            18.60543     27.74042      277.43923     70.27895     289.02233

| You nailed it! Good job!

|================================================================== | 85%
| See how these distances match those in the dendrogram? So hclust really works!
| Let's review now.

...

|=================================================================== | 87%
| What is the purpose of hierarchical clustering?

1: Inspire other researchers
2: Give an idea of the relationships between variables or observations
3: None of the others
4: Present a finished picture

Selection:
Enter an item from the menu, or 0 to exit
Selection: 2

| You are amazing!

|==================================================================== | 89%
| True or False? When you're doing hierarchical clustering there are strict rules that
| you MUST follow.

1: False
2: True

Selection: 1

| You got it!

|====================================================================== | 90%
| True or False? There's only one way to measure distance.

1: False
2: True

Selection: 1

| Excellent work!

|======================================================================= | 92%
| True or False? Complete linkage is a method of computing distances between clusters.

1: True
2: False

Selection: 1

| That's a job well done!

|======================================================================== | 94%
| True or False? Average linkage uses the maximum distance between points of two
| clusters as the distance between those clusters.

1: True
2: False

Selection: 2

| You nailed it! Good job!

|========================================================================= | 95%
| True or False? The number of clusters you derive from your data depends on the
| distance at which you choose to cut it.

1: True
2: False

Selection: 1

|=========================================================================== | 97%
| True or False? Once you decide basics, such as defining a distance metric and
| linkage method, hierarchical clustering is deterministic.

1: True
2: False

Selection: 1

| You nailed it! Good job!

|============================================================================ | 98%
| Congratulations! We hope this lesson didn't fluster you or get you too heated!

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: No
2: Yes

Selection: 2
What is your assignment token? xXxXxxXXxXxxXXXx

| That's the answer I was looking for.

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

Last updated 2020-05-09 20:05:13.711728 IST

# GGPlot2 Extras

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

setwd("C:/images")
library(swirl)

| Hi! Type swirl() when you are ready to begin.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as
| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

 1: Principles of Analytic Graphs   2: Exploratory Graphs
3: Graphics Devices in R           4: Plotting Systems
5: Base Plotting System            6: Lattice Plotting System
7: Working with Colors             8: GGPlot2 Part1
9: GGPlot2 Part2                  10: GGPlot2 Extras
11: Hierarchical Clustering        12: K Means Clustering
13: Dimension Reduction            14: Clustering Example
15: CaseStudy

Selection: 10

| Attempting to load lesson dependencies...

| | 0%

| GGPlot2_Extras. (Slides for this and other Data Science courses may be found at
| github https://github.com/DataScienceSpecialization/courses/. If you care to use
| them, they must be downloaded as a zip file and viewed locally. This lesson
| corresponds to 04_ExploratoryAnalysis/ggplot2.)

...

|= | 2%
| In this lesson we'll go through a few more qplot examples using diamond data which
| comes with the ggplot2 package. This data is a little more complicated than the mpg
| data and it contains information on various characteristics of diamonds.

...

|=== | 4%
| Run the R command str with the argument diamonds to see what the data looks like.

str(diamonds)

tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
$carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...$ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...$ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...$ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...$ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

| Keep working like that and you'll get there!

|==== | 6%
| From the output, how many characteristics of diamonds do you think this data
| contains?

1: 53950
2: 53940
3: 10
4: 5394

Selection: 3

| You are doing so well!

|====== | 7%
| From the output of str, how many diamonds are characterized in this dataset?

1: 53950
2: 10
3: 5394
4: 53940

Selection: 1

| You're close...I can feel it! Try it again.

| The output says there are 53940 observations of 10 variables. This is followed by a
| 10-long list of characteristics (carat, cut, color, etc.) that can apply to
| diamonds.

1: 5394
2: 53940
3: 10
4: 53950

Selection: 2

| You got it!

|======= | 9%
| Now let's plot a histogram of the price of the 53940 diamonds in this dataset.
| Recall that a histogram requires only one variable of the data, so run the R command
| qplot with the first argument price and the argument data set equal to diamonds.
| This will show the frequency of different diamond prices.

qplot(price,data=diamonds)
stat_bin() using bins = 30. Pick better value with binwidth.

| Excellent work!

|========= | 11%
| Not only do you get a histogram, but you also get a message about the binwidth
| defaulting to range/30. Recall that range refers to the spread or dispersion of the
| data, in this case price of diamonds. Run the R command range now with
| diamonds$price as its argument. range(diamonds$price)
[1] 326 18823

| You nailed it! Good job!

|========== | 13%
| We see that range returned the minimum and maximum prices, so the diamonds vary in
| price from $326 to$18823. We've done the arithmetic for you, the range (difference
| between these two numbers) is $18497. ... |=========== | 15% | Rerun qplot now with 3 arguments. The first is price, the second is data set equal | to diamonds, and the third is binwidth set equal to 18497/30). (Use the up arrow to | save yourself some typing.) See if the plot looks familiar. qplot(price,data=diamonds,binwidth=18497/30) | All that practice is paying off! |============= | 17% | No more messages in red, but a histogram almost identical to the previous one! If | you typed 18497/30 at the command line you would get the result 616.5667. This means | that the height of each bin tells you how many diamonds have a price between x and | x+617 where x is the left edge of the bin. ... |============== | 19% | We've created a vector containing integers that are multiples of 617 for you. It's | called brk. Look at it now. brk  [1] 0 617 1234 1851 2468 3085 3702 4319 4936 5553 6170 6787 7404 [14] 8021 8638 9255 9872 10489 11106 11723 12340 12957 13574 14191 14808 15425 [27] 16042 16659 17276 17893 18510 19127 | You are amazing! |================ | 20% | We've also created a vector containing the number of diamonds with prices between | each pair of adjacent entries of brk. For instance, the first count is the number of | diamonds with prices between 0 and$617, and the second is the number of diamonds | with prices between $617 and$1234. Look at the vector named counts now.

counts

 [1]  4611 13255  5230  4262  3362  2567  2831  2841  2203  1666  1445  1112   987
[14]   766   796   655   606   553   540   427   429   376   348   338   298   305
[27]   269   287   227   251    97

| You nailed it! Good job!

|================= | 22%
| See how it matches the histogram you just plotted? So, qplot really works!

...

|=================== | 24%
| You're probably sick of it but rerun qplot again, this time with 4 arguments. The
| first 3 are the same as the last qplot command you just ran (price, data set equal
| to diamonds, and binwidth set equal to 18497/30). (Use the up arrow to save yourself
| some typing.) The fourth argument is fill set equal to cut. The shape of the
| histogram will be familiar, but it will be more colorful.

qplot(price,data=diamonds,binwidth=18497/30,fill=cut)

| You're the best!

|==================== | 26%
| This shows how the counts within each price grouping (bin) are distributed among the
| different cuts of diamonds. Notice how qplot displays these distributions relative
| to the cut legend on the right. The fair cut diamonds are at the bottom of each bin,
| the good cuts are above them, then the very good above them, until the ideal cuts
| are at the top of each bin. You can quickly see from this display that there are
| very few fair cut diamonds priced above $5000. ... |===================== | 28% | Now we'll replot the histogram as a density function which will show the proportion | of diamonds in each bin. This means that the shape will be similar but the scale on | the y-axis will be different since, by definition, the density function is | nonnegative everywhere, and the area under the curve is one. To do this, simply call | qplot with 3 arguments. The first 2 are price and data (set equal to diamonds). The | third is geom which should be set equal to the string "density". Try this now. qplot(price,data=diamonds,geom="density") | Your dedication is inspiring! |======================= | 30% | Notice that the shape is similar to that of the histogram we saw previously. The | highest peak is close to 0 on the x-axis meaning that most of the diamonds in the | dataset were inexpensive. In general, as prices increase (move right along the | x-axis) the number of diamonds (at those prices) decrease. The exception to this is | when the price is around$4000; there's a slight increase in frequency. Let's see if
| cut is responsible for this increase.

...

|======================== | 31%
| Rerun qplot, this time with 4 arguments. The first 2 are the usual, and the third is
| geom set equal to "density". The fourth is color set equal to cut. Try this now.

qplot(price,data=diamonds,geom="density",color=cut)

| Keep working like that and you'll get there!

|========================== | 33%
| See how easily qplot did this? Four of the five cuts have 2 peaks, one at price
| $1000 and the other between$4000 and $5000. The exception is the Fair cut which has | a single peak at$2500. This gives us a little more understanding of the histogram
| we saw before.

...

|=========================== | 35%
| Let's move on to scatterplots. For these we'll need to specify two variables from
| the diamond dataset.

...

|============================= | 37%
| Let's start with carat and price. Use these as the first 2 arguments of qplot. The
| third should be data set equal to the dataset. Try this now.

qplot(carat,price,data=diamonds)

| You got it right!

|============================== | 39%
| We see the positive trend here, as the number of carats increases the price also
| goes up.

...

|=============================== | 41%
| Now rerun the same command, except add a fourth parameter, shape, set equal to cut.

qplot(carat,price,data=diamonds,shape=cut)
Warning message:
Using shapes for an ordinal variable is not advised

| You are doing so well!

|================================= | 43%
| The same scatterplot appears, except the cuts of the diamonds are distinguished by
| different symbols. The legend at the right tells you which symbol is associated with
| each cut. These are small and hard to read, so rerun the same command, except this
| time instead of setting the argument shape equal to cut, set the argument color
| equal to cut.

qplot(carat,price,data=diamonds,color=cut)

| Excellent job!

|================================== | 44%
| That's easier to see! Now we'll close with two, more complicated scatterplot
| examples.

...

|==================================== | 46%
| We'll rerun the plot you just did (carat,price,data=diamonds and color=cut) but add
| an additional parameter. Use geom_smooth with the method set equal to the string
| "lm".

qplot(carat,price,data=diamonds,color=cut)+geom_smooth(method="lm")
geom_smooth() using formula 'y ~ x'

| That's a job well done!

|===================================== | 48%
| Again, we see the same scatterplot, but slightly more compressed and showing 5
| regression lines, one for each cut of diamonds. It might be hard to see, but around
| each line is a shadow showing the 95% confidence interval. We see, unsurprisingly,
| that the better the cut, the steeper (more positive) the slope of the lines.

...

|====================================== | 50%
| Finally, let's rerun that plot you just did qplot(carat,price,data=diamonds,
| color=cut) + geom_smooth(method="lm") but add one (just one) more argument to qplot.
| The new argument is facets and it should be set equal to the formula .~cut. Recall
| that the facets argument indicates we want a multi-panel plot. The symbol to the
| left of the tilde indicates rows (in this case just one) and the symbol to the right
| of the tilde indicates columns (in this five, the number of cuts). Try this now.

qplot(carat,price,data=diamonds,color=cut,facets=.~cut)+geom_smooth(method="lm")
geom_smooth() using formula 'y ~ x'

| You are quite good my friend!

|======================================== | 52%
| Pretty good, right? Not too difficult either. Let's review what we learned!

...

|========================================= | 54%
| Which types of plot does qplot plot?

1: box and whisker plots
2: histograms
3: all of the others
4: scatterplots

Selection: 3

| You are doing so well!

|=========================================== | 56%
| Any and all of the above choices work; qplot is just that good. What does the gg in
| ggplot2 stand for?

1: good grief
2: goto graphics
3: grammar of graphics
4: good graphics

Selection: 3

| You are amazing!

|============================================ | 57%
| True or False? The geom argument takes a string for a value.

1: False
2: True

Selection: 2

| You are really on a roll!

|============================================== | 59%
| True or False? The method argument takes a string for a value.

1: True
2: False

Selection: 1

| You got it!

|=============================================== | 61%
| True or False? The binwidth argument takes a string for a value.

1: True
2: False

Selection: 2

| You nailed it! Good job!

|================================================ | 63%
| True or False? The user must specify x- and y-axis labels when using qplot.

1: False
2: True

Selection: 1

| That's correct!

|================================================== | 65%
| Now for some ggplots.

...

|=================================================== | 67%
| First create a graphical object g by assigning to it the output of a call to the
| function ggplot with 2 arguments. The first is the dataset diamonds and the second
| is a call to the function aes with 2 arguments, depth and price. Remember you won't
| see any result.

g<-ggplot(data=diamonds,aes(depth,price))

| You are quite good my friend!

|===================================================== | 69%
| Does g exist? Yes! Type summary with g as an argument to see what it holds.

summary(g)

data: carat, cut, color, clarity, depth, table, price, x, y, z [53940x10]
mapping:  x = ~depth, y = ~price
faceting: <ggproto object: Class FacetNull, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super:  <ggproto object: Class FacetNull, Facet, gg>

| That's a job well done!

|====================================================== | 70%
| We see that g holds the entire dataset. Now suppose we want to see a scatterplot of
| the relationship. Add to g a call to the function geom_point with 1 argument, alpha
| set equal to 1/3.

g+geom_point(alpha=1/3)

| You're the best!

|======================================================== | 72%
| That's somewhat interesting. We see that depth ranges from 43 to 79, but the densest
| distribution is around 60 to 65. Suppose we want to see if this relationship
| (between depth and price) is affected by cut or carat. We know cut is a factor with
| 5 levels (Fair, Good, Very Good, Premium, and Ideal). But carat is numeric and not a
| discrete factor. Can we do this?

...

|========================================================= | 74%
| Of course! That's why we asked. R has a handy command, cut, which allows you to
| divide your data into sets and label each entry as belonging to one of the sets, in
| effect creating a new factor. First, we'll have to decide where to cut the data.

...

|========================================================== | 76%
| Let's divide the data into 3 pockets, so 1/3 of the data falls into each. We'll use
| the R command quantile to do this. Create the variable cutpoints and assign to it
| the output of a call to the function quantile with 3 arguments. The first is the
| data to cut, namely diamonds$carat; the second is a call to the R function seq. This | is also called with 3 arguments, (0, 1, and length set equal to 4). The third | argument to the call to quantile is the boolean na.rm set equal to TRUE. cutpoints<-quantile(diamonds$carat,seq(0,1,length=4),na.rm=TRUE)

| Keep working like that and you'll get there!

|============================================================ | 78%
| Look at cutpoints now to understand what it is.

cutpoints

   0% 33.33333% 66.66667%      100%
0.20      0.50      1.00      5.01



| You got it right!

|============================================================= | 80%
| We see a 4-long vector (explaining why length was set equal to 4). We also see that
| .2 is the smallest carat size in the dataset and 5.01 is the largest. One third of
| the diamonds are between .2 and .5 carats and another third are between .5 and 1
| carat in size. The remaining third are between 1 and 5.01 carats. Now we can use the
| R command cut to label each of the 53940 diamonds in the dataset as belonging to one
| of these 3 factors. Create a new name in diamonds, diamonds$car2 by assigning it the | output of the call to cut. This command takes 2 arguments, diamonds$carat, which is
| what we want to cut, and cutpoints, the places where we'll cut.

diamonds$car2<-cut(diamonds$carat,cutpoints)

| You are quite good my friend!

|=============================================================== | 81%
| Now we can continue with our multi-facet plot. First we have to reset g since we
| changed the dataset (diamonds) it contained (by adding a new column). Assign to g
| the output of a call to ggplot with 2 arguments. The dataset diamonds is the first,
| and a call to the function aes with 2 arguments (depth,price) is the second.

g<-ggplot(data=diamonds,aes(depth,price))

| You're the best!

|================================================================ | 83%
| Now add to g calls to 2 functions. This first is a call to geom_point with the
| argument alpha set equal to 1/3. The second is a call to the function facet_grid
| using the formula cut ~ car2 as its argument.

g+geom_point(alpha=1/3)+facet_grid(cut~car2)

| That's correct!

|================================================================== | 85%
| We see a multi-facet plot with 5 rows, each corresponding to a cut factor. Not
| surprising. What is surprising is the number of columns. We were expecting 3 and got
| 4. Why?

...

|=================================================================== | 87%
| The first 3 columns are labeled with the cutpoint boundaries. The fourth is labeled
| NA and shows us where the data points with missing data (NA or Not Available)
| occurred. We see that there were only a handful (12 in fact) and they occurred in
| Very Good, Premium, and Ideal cuts. We created a vector, myd, containing the indices
| of these datapoints. Look at these entries in diamonds by typing the expression
| diamonds[myd,]. The myd tells R what rows to show and the empty column entry says to
| print all the columns.

diamonds[myd,]

# A tibble: 12 x 11
carat cut       color clarity depth table price     x     y     z car2
<dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>
1   0.2 Premium   E     SI2      60.2    62   345  3.79  3.75  2.27 NA
2   0.2 Premium   E     VS2      59.8    62   367  3.79  3.77  2.26 NA
3   0.2 Premium   E     VS2      59      60   367  3.81  3.78  2.24 NA
4   0.2 Premium   E     VS2      61.1    59   367  3.81  3.78  2.32 NA
5   0.2 Premium   E     VS2      59.7    62   367  3.84  3.8   2.28 NA
6   0.2 Ideal     E     VS2      59.7    55   367  3.86  3.84  2.3  NA
7   0.2 Premium   F     VS2      62.6    59   367  3.73  3.71  2.33 NA
8   0.2 Ideal     D     VS2      61.5    57   367  3.81  3.77  2.33 NA
9   0.2 Very Good E     VS2      63.4    59   367  3.74  3.71  2.36 NA
10   0.2 Ideal     E     VS2      62.2    57   367  3.76  3.73  2.33 NA
11   0.2 Premium   D     VS2      62.3    60   367  3.73  3.68  2.31 NA
12   0.2 Premium   D     VS2      61.7    60   367  3.77  3.72  2.31 NA

| You're the best!

|==================================================================== | 89%
| We see these entries match the plots. Whew - that's a relief. The car2 field is, in
| fact, NA for these entries, but the carat field shows they each had a carat size of
| .2. What's going on here?

...

|====================================================================== | 91%
| Actually our plot answers this question. The boundaries for each column appear in
| the gray labels at the top of each column, and we see that the first column is
| labeled (0.2,0.5]. This indicates that this column contains data greater than .2 and
| less than or equal to .5. So diamonds with carat size .2 were excluded from the car2
| field.

...

|======================================================================= | 93%
| Finally, recall the last plotting command
| (g+geom_point(alpha=1/3)+facet_grid(cut~car2)) or retype it if you like and add
| another call. This one to the function geom_smooth. Pass it 3 arguments, method set
| equal to the string "lm", size set equal to 3, and color equal to the string "pink".

play()

| Entering play mode. Experiment as you please, then type nxt() when you are ready to
| resume the lesson.

diamonds[is.na(diamonds$car2),] # A tibble: 12 x 11 carat cut color clarity depth table price x y z car2 <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> 1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27 NA 2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26 NA 3 0.2 Premium E VS2 59 60 367 3.81 3.78 2.24 NA 4 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32 NA 5 0.2 Premium E VS2 59.7 62 367 3.84 3.8 2.28 NA 6 0.2 Ideal E VS2 59.7 55 367 3.86 3.84 2.3 NA 7 0.2 Premium F VS2 62.6 59 367 3.73 3.71 2.33 NA 8 0.2 Ideal D VS2 61.5 57 367 3.81 3.77 2.33 NA 9 0.2 Very Good E VS2 63.4 59 367 3.74 3.71 2.36 NA 10 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33 NA 11 0.2 Premium D VS2 62.3 60 367 3.73 3.68 2.31 NA 12 0.2 Premium D VS2 61.7 60 367 3.77 3.72 2.31 NA nxt() | Resuming lesson... | Finally, recall the last plotting command | (g+geom_point(alpha=1/3)+facet_grid(cut~car2)) or retype it if you like and add | another call. This one to the function geom_smooth. Pass it 3 arguments, method set | equal to the string "lm", size set equal to 3, and color equal to the string "pink". g+geom_point(alpha=1/3)+facet_grid(cut~car2)+geom_smooth(method="lm",size=3,color="pink") geom_smooth() using formula 'y ~ x' | Keep up the great work! |========================================================================= | 94% | Nice thick regression lines which are somewhat interesting. You can add labels to | the plot if you want but we'll let you experiment on your own. ... |========================================================================== | 96% | Lastly, ggplot2 can, of course, produce boxplots. This final exercise is the sum of | 3 function calls. The first call is to ggplot with 2 arguments, diamonds and a call | to aes with carat and price as arguments. The second call is to geom_boxplot with no | arguments. The third is to facet_grid with one argument, the formula . ~ cut. Try | this now. ggplot(diamonds,aes(carat,price))+geom_boxplot()+facet_grid(.~cut) Warning message: Continuous y aesthetic -- did you forget aes(group=...)? | Perseverance, that's the answer. |============================================================================ | 98% | Yes! A boxplot looking like marshmallows about to be roasted. Well done and | congratulations! You've finished this jewel of a lesson. Hope it paid off! ... |=============================================================================| 100% | Would you like to receive credit for completing this course on Coursera.org? 1: Yes 2: No Selection: 1 What is your email address? [email protected] What is your assignment token? xXxXxxXXxXxxXXXx Grade submission succeeded! | Excellent job! | You've reached the end of this lesson! Returning to the main menu... | Please choose a course, or type 0 to exit swirl. 1: Exploratory Data Analysis 2: Take me to the swirl course repository! Selection: 0 | Leaving swirl now. Type swirl() to resume. rm(list=ls()) Last updated 2020-05-09 22:04:44.535875 IST # GGPlot2 Part2 library(swirl) swirl() | Welcome to swirl! Please sign in. If you've been here before, use the same name as | you did then. If you are new, call yourself something unique. What shall I call you? Krishnakanth Allika | Please choose a course, or type 0 to exit swirl. 1: Exploratory Data Analysis 2: Take me to the swirl course repository! Selection: 1 | Please choose a lesson, or type 0 to return to course menu.  1: Principles of Analytic Graphs 2: Exploratory Graphs 3: Graphics Devices in R 4: Plotting Systems 5: Base Plotting System 6: Lattice Plotting System 7: Working with Colors 8: GGPlot2 Part1 9: GGPlot2 Part2 10: GGPlot2 Extras 11: Hierarchical Clustering 12: K Means Clustering 13: Dimension Reduction 14: Clustering Example 15: CaseStudy Selection: 9 | Attempting to load lesson dependencies... | Package ‘ggplot2’ loaded correctly! | | 0% | GGPlot2_Part2. (Slides for this and other Data Science courses may be found at github | https://github.com/DataScienceSpecialization/courses/. If you care to use them, they | must be downloaded as a zip file and viewed locally. This lesson corresponds to | 04_ExploratoryAnalysis/ggplot2.) ... |== | 2% | In a previous lesson we showed you the vast capabilities of qplot, the basic | workhorse function of the ggplot2 package. In this lesson we'll focus on some | fundamental components of the package. These underlie qplot which uses default values | when it calls them. If you understand these building blocks, you will be better able | to customize your plots. We'll use the second workhorse function in the package, | ggplot, as well as other graphing functions. ... |=== | 4% | Do you remember what the gg of ggplot2 stands for? 1: grammar of graphics 2: good grief 3: great graphics 4: goto graphics Selection: 1 | That's the answer I was looking for. |===== | 6% | A "grammar" of graphics means that ggplot2 contains building blocks with which you | can create your own graphical objects. What are these basic components of ggplot2 | plots? There are 7 of them. ... |====== | 8% | Obviously, there's a DATA FRAME which contains the data you're trying to plot. Then | the AESTHETIC MAPPINGS determine how data are mapped to color, size, etc. The GEOMS | (geometric objects) are what you see in the plot (points, lines, shapes) and FACETS | are the panels used in conditional plots. You've used these or seen them used in the | first ggplot2 (qplot) lesson. ... |======== | 10% | There are 3 more. STATS are statistical transformations such as binning, quantiles, | and smoothing which ggplot2 applies to the data. SCALES show what coding an aesthetic | map uses (for example, male = red, female = blue). Finally, the plots are depicted on | a COORDINATE SYSTEM. When you use qplot these were taken care of for you. ... |========== | 12% | Do you remember what the "artist's palette" model means in the context of plotting? 1: we draw pictures 2: we mix paints 3: plots are built up in layers 4: things get messy Selection: 3 | You nailed it! Good job! |=========== | 15% | As in the base plotting system (and in contrast to the lattice system), when building | plots with ggplot2, the plots are built up in layers, maybe in several steps. You can | plot the data, then overlay a summary (for instance, a regression line or smoother) | and then add any metadata and annotations you need. ... |============= | 17% | We'll keep using the mpg data that comes with the ggplot2 package. Recall the | versatility of qplot. Just as a refresher, call qplot now with 5 arguments. The first | 3 deal with data - displ, hwy, and data=mpg. The fourth is geom set equal to the | concatenation of the two strings, "point" and "smooth". The fifth is facets set equal | to the formula .~drv. Try this now. qplot(displ,hwy,data=mpg,geom=c("point","smooth"),facets=.~drv) geom_smooth() using method = 'loess' and formula 'y ~ x' | You got it! |=============== | 19% | We see a 3 facet plot, one for each drive type (4, f, and r). Now we'll see how | ggplot works. We'll build up a similar plot using the basic components of the | package. We'll do this in a series of steps. ... |================ | 21% | First we'll create a variable g by assigning to it the output of a call to ggplot | with 2 arguments. The first is mpg (our dataset) and the second will tell ggplot what | we want to plot, in this case, displ and hwy. These are what we want our aesthetics | to represent so we enclose these as two arguments to the function aes. Try this now. g<-ggplot(mpg,aes(displ,hwy)) | You are quite good my friend! |================== | 23% | Notice that nothing happened? As in the lattice system, ggplot created a graphical | object which we assigned to the variable g. ... |==================== | 25% | Run the R command summary with g as its argument to see what g contains. summary(g) data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11] mapping: x = ~displ, y = ~hwy faceting: <ggproto object: Class FacetNull, Facet, gg> compute_layout: function draw_back: function draw_front: function draw_labels: function draw_panels: function finish_data: function init_scales: function map_data: function params: list setup_data: function setup_params: function shrink: TRUE train_scales: function vars: function super: <ggproto object: Class FacetNull, Facet, gg> | You are quite good my friend! |===================== | 27% | So g contains the mpg data with all its named components in a 234 by 11 matrix. It | also contains a mapping, x (displ) and y (hwy) which you specified, and no faceting. ... |======================= | 29% | Note that if you tried to print g with the expressions g or print(g) you'd get an | error! Even though it's a great package, ggplot doesn't know how to display the data | yet since you didn't specify how you wanted to see it. Now type g+geom_point() and | see what happens. g+geom_point() | You are quite good my friend! |======================== | 31% | By calling the function geom_point you added a layer. By not assigning the expression | to a variable you displayed a plot. Notice that you didn't have to pass any arguments | to the function geom_point. That's because the object g has all the data stored in | it. (Remember you saw that when you ran summary on g before.) Now use the expression | you just typed (g + geom_point()) and add to it another layer, a call to | geom_smooth(). Notice the red message R gives you. g+geom_point()+geom_smooth() geom_smooth() using method = 'loess' and formula 'y ~ x' | You got it! |========================== | 33% | The gray shadow around the blue line is the confidence band. See how wide it is at | the right? Let's try a different smoothing function. Use the up arrow to recover the | expression you just typed, and instead of calling geom_smooth with no arguments, call | it with the argument method set equal to the string "lm". g+geom_point()+geom_smooth(method="lm") geom_smooth() using formula 'y ~ x' | Excellent work! |============================ | 35% | By changing the smoothing function to "lm" (linear model) ggplot2 generated a | regression line through the data. Now recall the expression you just used and add to | it another call, this time to the function facet_grid. Use the formula . ~ drv as it | argument. Note that this is the same type of formula used in the calls to qplot. g+geom_point()+geom_smooth(method="lm")+facet_grid(.~drv) geom_smooth() using formula 'y ~ x' | Your dedication is inspiring! |============================= | 38% | Notice how each panel is labeled with the appropriate factor. All the data associated | with 4-wheel drive cars is in the leftmost panel, front-wheel drive data is shown in | the middle panel, and rear-wheel drive data in the rightmost. Notice that this is | similar to the plot you created at the start of the lesson using qplot. (We used a | different smoothing function than previously.) ... |=============================== | 40% | So far you've just used the default labels that ggplot provides. You can add your own | annotation using functions such as xlab(), ylab(), and ggtitle(). In addition, the | function labs() is more general and can be used to label either or both axes as well | as provide a title. Now recall the expression you just typed and add a call to the | function ggtitle with the argument "Swirl Rules!". g+geom_point()+geom_smooth(method="lm")+facet_grid(.~drv)+ggtitle("Swirl Rules!") geom_smooth() using formula 'y ~ x' | You are doing so well! |================================ | 42% | Now that you've seen the basics we'll talk about customizing. Each of the “geom” | functions (e.g., _point and _smooth) has options to modify it. Also, the function | theme() can be used to modify aspects of the entire plot, e.g. the position of the | legend. Two standard appearance themes are included in ggplot. These are theme_gray() | which is the default theme (gray background with white grid lines) and theme_bw() | which is a plainer (black and white) color scheme. ... |================================== | 44% | Let's practice modifying aesthetics now. We'll use the graphic object g that we | already filled with mpg data and add a call to the function geom_point, but this time | we'll give geom_point 3 arguments. Set the argument color equal to "pink", the | argument size to 4, and the argument alpha to 1/2. Notice that all the arguments are | set equal to constants. g+geom_point(color="pink",size=4,alpha=0.5) | You are doing so well! |==================================== | 46% | Notice the different shades of pink? That's the result of the alpha aesthetic which | you set to 1/2. This aesthetic tells ggplot how transparent the points should be. | Darker circles indicate values hit by multiple data points. ... |===================================== | 48% | Now we'll modify the aesthetics so that color indicates which drv type each point | represents. Again, use g and add to it a call to the function geom_point with 3 | arguments. The first is size set equal to 4, the second is alpha equal to 1/2. The | third is a call to the function aes with the argument color set equal to drv. Note | that you MUST use the function aes since the color of the points is data dependent | and not a constant as it was in the previous example. g+geom_point(size=4,alpha=0.5,aes(color=drv)) | That's a job well done! |======================================= | 50% | Notice the helpful legend on the right decoding the relationship between color and | drv. ... |========================================= | 52% | Now we'll practice modifying labels. Again, we'll use g and add to it calls to 3 | functions. First, add a call to geom_point with an argument making the color | dependent on the drv type (as we did in the previous example). Second, add a call to | the function labs with the argument title set equal to "Swirl Rules!". Finally, add a | call to labs with 2 arguments, one setting x equal to "Displacement" and the other | setting y equal to "Hwy Mileage". g+geom_point(aes(color=drv))+labs(title="Swirl Rules!")+labs(x="Displacement",y="Hwy Mileage") | You are amazing! |========================================== | 54% | Note that you could have combined the two calls to the function labs in the previous | example. Now we'll practice customizing the geom_smooth calls. Use g and add to it a | call to geom_point setting the color to drv type (remember to use the call to the aes | function), size set to 2 and alpha to 1/2. Then add a call to geom_smooth with 4 | arguments. Set size equal to 4, linetype to 3, method to "lm", and se to FALSE. g+geom_point(aes(color=drv),size=2,alpha=0.5)+geom_smooth(size=4,linetype=3,method="lm",se=FALSE) geom_smooth() using formula 'y ~ x' | Perseverance, that's the answer. |============================================ | 56% | What did these arguments do? The method specified a linear regression (note the | negative slope indicating that the bigger the displacement the lower the gas | mileage), the linetype specified that it should be dashed (not continuous), the size | made the dashes big, and the se flag told ggplot to turn off the gray shadows | indicating standard errors (confidence intervals). ... |============================================= | 58% | Finally, let's do a simple plot using the black and white theme, theme_bw. Specify g | and add a call to the function geom_point with the argument setting the color to the | drv type. Then add a call to the function theme_bw with the argument base_family set | equal to "Times". See if you notice the difference. g+geom_point(aes(color=drv))+theme_bw(base_family = "Times") There were 13 warnings (use warnings() to see them) | Nice work! |=============================================== | 60% | No more gray background! Also, if you have good eyesight, you'll notice that the font | in the labels changed. ... |================================================= | 62% | One final note before we go through a more complicated, layered ggplot example, and | this concerns the limits of the axes. We're pointing this out to emphasize a subtle | difference between ggplot and the base plotting function plot. ... |================================================== | 65% | We've created some random x and y data, called myx and myy, components of a dataframe | called testdat. These represent 100 random normal points, except halfway through, we | made one of the points be an outlier. That is, we set its y-value to be out of range | of the other points. Use the base plotting function plot to create a line plot of | this data. Call it with 4 arguments - myx, myy, type="l", and ylim=c(-3,3). The | type="l" tells plot you want to display the data as a line instead of as a | scatterplot. warning messages from top-level task callback 'mini' There were 40 warnings (use warnings() to see them) play() | Entering play mode. Experiment as you please, then type nxt() when you are ready to | resume the lesson. g+geom_point(aes(color=drv))+theme_dark() g+geom_point(aes(color=drv))+theme_minimal() g+geom_point(aes(color=drv))+theme_grey() nxt() | Resuming lesson... | We've created some random x and y data, called myx and myy, components of a dataframe | called testdat. These represent 100 random normal points, except halfway through, we | made one of the points be an outlier. That is, we set its y-value to be out of range | of the other points. Use the base plotting function plot to create a line plot of | this data. Call it with 4 arguments - myx, myy, type="l", and ylim=c(-3,3). The | type="l" tells plot you want to display the data as a line instead of as a | scatterplot. plot(myx,myy,type="l",ylim=c(-3,3)) | You got it! |==================================================== | 67% | Notice how plot plotted the points in the (-3,3) range for y-values. The outlier at | (50,100) is NOT shown on the line plot. Now we'll plot the same data with ggplot. | Recall that the name of the dataframe is testdat. Create the graphical object g with | a call to ggplot with 2 arguments, testdat (the data) and a call to aes with 2 | arguments, x set equal to myx, and y set equal to myy. g<-ggplot(data=testdat,aes(x=myx,y=myy)) | You got it! |====================================================== | 69% | Now add a call to geom_line with 0 arguments to g. g+geom_line() | You got it right! |======================================================= | 71% | Notice how ggplot DID display the outlier point at (50,100). As a result the rest of | the data is smashed down so you don't get to see what the bulk of it looks like. The | single outlier probably isn't important enough to dominate the graph. How do we get | ggplot to behave more like plot in a situation like this? ... |========================================================= | 73% | Let's take a guess that in addition to adding geom_line() to g we also just have to | add ylim(-3,3) to it as we did with the call to plot. Try this now to see what | happens. g+geom_line()+ylim(-3,3) | Perseverance, that's the answer. |========================================================== | 75% | Notice that by doing this, ggplot simply ignored the outlier point at (50,100). | There's a break in the line which isn't very noticeable. Now recall that at the | beginning of the lesson we mentioned 7 components of a ggplot plot, one of which was | a coordinate system. This is a situation where using a coordinate system would be | helpful. Instead of adding ylim(-3,3) to the expression g+geom_line(), add a call to | the function coord_cartesian with the argument ylim set equal to c(-3,3). g+geom_line()+coord_cartesian(ylim=c(-3,3)) | You are really on a roll! |============================================================ | 77% | See the difference? This looks more like the plot produced by the base plot function. | The outlier y value at x=50 is not shown, but the plot indicates that it is larger | than 3. ... |============================================================== | 79% | We'll close with a more complicated example to show you the full power of ggplot and | the entire ggplot2 package. We'll continue to work with the mpg dataset. ... |=============================================================== | 81% | Start by creating the graphical object g by assigning to it a call to ggplot with 2 | arguments. The first is the dataset and the second is a call to the function aes. | This call will have 3 arguments, x set equal to displ, y set equal to hwy, and color | set equal to factor(year). This last will allow us to distinguish between the two | manufacturing years (1999 and 2008) in our data. g<-ggplot(data=mpg,aes(x=displ,y=hwy,color=factor(year))) | All that practice is paying off! |================================================================= | 83% | Uh oh! Nothing happened. Does g exist? Of course, it just isn't visible yet since you | didn't add a layer. ... |=================================================================== | 85% | If you typed g at the command line, what would happen? 1: a scatterplot would appear with 2 colors of points 2: I would have to try this to answer the question 3: R would return an error in red Selection: 3 | You got it! |==================================================================== | 88% | We'll build the plot up step by step. First add to g a call to the function | geom_point with 0 arguments. g+geom_point() | You nailed it! Good job! |====================================================================== | 90% | A simple, yet comfortingly familiar scatterplot appears. Let's make our display a 2 | dimensional multi-panel plot. Recall your last command (with the up arrow) and add to | it a call the function facet_grid. Give it 2 arguments. The first is the formula | drv~cyl, and the second is the argument margins set equal to TRUE. Try this now. g+geom_point()+facet_grid(drv~cyl,margins=TRUE) | Keep up the great work! |======================================================================== | 92% | A 4 by 5 plot, huh? The margins argument tells ggplot to display the marginal totals | over each row and column, so instead of seeing 3 rows (the number of drv factors) and | 4 columns (the number of cyl factors) we see a 4 by 5 display. Note that the panel in | position (4,5) is a tiny version of the scatterplot of the entire dataset. ... |========================================================================= | 94% | Now add to your last command (or retype it if you like to type) a call to geom_smooth | with 4 arguments. These are method set to "lm", se set to FALSE, size set to 2, and | color set to "black". g+geom_point()+facet_grid(drv~cyl,margins=TRUE)+geom_smooth(method="lm",se=FALSE,size=2,color="black") geom_smooth() using formula 'y ~ x' | Keep up the great work! |=========================================================================== | 96% | Angry Birds? Finally, add to your last command (or retype it if you like to type) a | call to the function labs with 3 arguments. These are x set to "Displacement", y set | to "Highway Mileage", and title set to "Swirl Rules!". g+geom_point()+facet_grid(drv~cyl,margins=TRUE)+geom_smooth(method="lm",se=FALSE,size=2,color="black")+labs(x="Displacement",y="Highway Mileage",title="Swirl Rules!") geom_smooth() using formula 'y ~ x' | Keep working like that and you'll get there! |============================================================================ | 98% | You could have done these labels with separate calls to labs but we thought you'd be | sick of this by now. Anyway, congrats! You've concluded part 2 of ggplot2. We hope | you got enough mileage out of the lesson. If you like ggplot2 you can do some extras | with the extra lesson. ... |==============================================================================| 100% | Would you like to receive credit for completing this course on Coursera.org? 1: Yes 2: No Selection: 1 What is your email address? [email protected] What is your assignment token? xXxXxxXXxXxxXXXx Grade submission succeeded! | You got it right! | You've reached the end of this lesson! Returning to the main menu... | Please choose a course, or type 0 to exit swirl. 1: Exploratory Data Analysis 2: Take me to the swirl course repository! Selection: 0 | Leaving swirl now. Type swirl() to resume. g+geom_point()+facet_grid(drv~cyl,margins=TRUE)+geom_smooth(method="lm",se=FALSE,size=2,color="black")+labs(x="Displacement",y="Highway Mileage",title="Swirl Rules!")+theme(plot.title = element_text(hjust = 0.5)) geom_smooth() using formula 'y ~ x' rm(list=ls()) Last updated 2020-05-08 21:23:15.085181 IST # GGPlot2 Part1 library(swirl) library(ggplot2) swirl() | Welcome to swirl! Please sign in. If you've been here before, use the same name as | you did then. If you are new, call yourself something unique. What shall I call you? Krishnakanth Allika | Please choose a course, or type 0 to exit swirl. 1: Exploratory Data Analysis 2: Take me to the swirl course repository! Selection: 1 | Please choose a lesson, or type 0 to return to course menu. 1: Principles of Analytic Graphs 2: Exploratory Graphs 3: Graphics Devices in R 4: Plotting Systems 5: Base Plotting System 6: Lattice Plotting System 7: Working with Colors 8: GGPlot2 Part1 9: GGPlot2 Part2 10: GGPlot2 Extras 11: Hierarchical Clustering 12: K Means Clustering 13: Dimension Reduction 14: Clustering Example 15: CaseStudy Selection: 8 | Attempting to load lesson dependencies... | Package ‘ggplot2’ loaded correctly! | | 0% | GGPlot2_Part1. (Slides for this and other Data Science courses may be found at | github https://github.com/DataScienceSpecialization/courses/. If you care to use | them, they must be downloaded as a zip file and viewed locally. This lesson | corresponds to 04_ExploratoryAnalysis/ggplot2.) ... |== | 2% | In another lesson, we gave you an overview of the three plotting systems in R. In | this lesson we'll focus on the third and newest plotting system in R, ggplot2. As | we did with the other two systems, we'll focus on creating graphics on the screen | device rather than another graphics device. ... |==== | 5% | The ggplot2 package is an add-on package available from CRAN via install.packages(). | (Don't worry, we've installed it for you already.) It is an implementation of The | Grammar of Graphics, an abstract concept (as well as book) authored and invented by | Leland Wilkinson and implemented by Hadley Wickham while he was a graduate student | at Iowa State. The web site http://ggplot2.org provides complete documentation. ... |====== | 7% | A grammar of graphics represents an abstraction of graphics, that is, a theory of | graphics which conceptualizes basic pieces from which you can build new graphics and | graphical objects. The goal of the grammar is to “Shorten the distance from mind to | page”. From Hadley Wickham's book we learn that ... |======== | 10% | The ggplot2 package "is composed of a set of independent components that can be | composed in many different ways. ... you can create new graphics that are precisely | tailored for your problem." These components include aesthetics which are attributes | such as colour, shape, and size, and geometric objects or geoms such as points, | lines, and bars. ... |========= | 12% | Before we delve into details, let's review the other 2 plotting systems. ... |=========== | 15% | Recall what you know about R's base plotting system. Which of the following does NOT | apply to it? 1: It is convenient and mirrors how we think of building plots and analyzing data 2: Can easily go back once the plot has started (e.g., to adjust margins or correct a typo) 3: Use annotation functions to add/modify (text, lines, points, axis) 4: Start with plot (or similar) function Selection: 2 | That's correct! |============= | 17% | Recall what you know about R's lattice plotting system. Which of the following does | NOT apply to it? 1: Margins and spacing are set automatically because entire plot is specified at once 2: Most useful for conditioning types of plots and putting many panels on one plot 3: Can always add to the plot once it is created 4: Plots are created with a single function call (xyplot, bwplot, etc.) Selection: 3 | Excellent job! |=============== | 20% | If we told you that ggplot2 combines the best of base and lattice, that would mean | it ...? 1: Automatically deals with spacings, text, titles but also allows you to annotate 2: Its default mode makes many choices for you (but you can customize!) 3: All of the others 4: Like lattice it allows for multipanels but more easily and intuitively Selection: 3 | You are quite good my friend! |================= | 22% | Yes, ggplot2 combines the best of base and lattice. It allows for multipanel | (conditioning) plots (as lattice does) but also post facto annotation (as base | does), so you can add titles and labels. It uses the low-level grid package (which | comes with R) to draw the graphics. As part of its grammar philosophy, ggplot2 plots | are composed of aesthetics (attributes such as size, shape, color) and geoms | (points, lines, and bars), the geometric objects you see on the plot. ... |=================== | 24% | The ggplot2 package has 2 workhorse functions. The more basic workhorse function is | qplot, (think quick plot), which works like the plot function in the base graphics | system. It can produce many types of plots (scatter, histograms, box and whisker) | while hiding tedious details from the user. Similar to lattice functions, it looks | for data in a data frame or parent environment. ... |===================== | 27% | The more advanced workhorse function in the package is ggplot, which is more | flexible and can be customized for doing things qplot cannot do. In this lesson | we'll focus on qplot. ... |======================= | 29% | We'll start by showing how easy and versatile qplot is. First, let's look at some | data which comes with the ggplot2 package. The mpg data frame contains fuel economy | data for 38 models of cars manufactured in 1999 and 2008. Run the R command str with | the argument mpg. This will give you an idea of what mpg contains. str(mpg) tibble [234 x 11] (S3: tbl_df/tbl/data.frame)$ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
$model : chr [1:234] "a4" "a4" "a4" "a4" ...$ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...$ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
$trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...$ drv         : chr [1:234] "f" "f" "f" "f" ...
$cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...$ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
$fl : chr [1:234] "p" "p" "p" "p" ...$ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

| You are really on a roll!

|======================== | 32%
| We see that there are 234 points in the dataset concerning 11 different
| characteristics of the cars. Suppose we want to see if there's a correlation between
| engine displacement (displ) and highway miles per gallon (hwy). As we did with the
| plot function of the base system we could simply call qplot with 3 arguments, the
| first two are the variables we want to examine and the third argument data is set
| equal to the name of the dataset which contains them (in this case, mpg). Try this
| now.

qplot(displ,hwy,data=mpg)

| You are amazing!

|========================== | 34%
| A nice scatterplot done simply, right? All the labels are provided. The first
| argument is shown along the x-axis and the second along the y-axis. The negative
| trend (increasing displacement and lower gas mileage) is pretty clear. Now suppose
| we want to do the same plot but this time use different colors to distinguish
| between the 3 factors (subsets) of different types of drive (drv) in the data
| (front-wheel, rear-wheel, and 4-wheel). Again, qplot makes this very easy. We'll
| just add what ggplot2 calls an aesthetic, a fourth argument, color, and set it equal
| to drv. Try this now. (Use the up arrow key to save some typing.)

qplot(displ,hwy,data=mpg,color=drv)

| All that hard work is paying off!

|============================ | 37%
| Pretty cool, right? See the legend to the right which qplot helpfully supplied? The
| colors were automatically assigned by qplot so the legend decodes the colors for
| you. Notice that qplot automatically used dots or points to indicate the data. These
| points are geoms (geometric objects). We could have used a different aesthetic, for
| instance shape instead of color, to distinguish between the drive types.

...

|============================== | 39%
| Now let's add a second geom to the default points. How about some smoothing function
| to produce trend lines, one for each color? Just add a fifth argument, geom, and
| using the R function c(), set it equal to the concatenation of the two strings
| "point" and "smooth". The first refers to the data points and second to the trend
| lines we want plotted. Try this now.

qplot(displ,hwy,data=mpg,color=drv,geom=c("point","smooth"))
geom_smooth() using method = 'loess' and formula 'y ~ x'

| That's correct!

|================================ | 41%
| Notice the gray areas surrounding each trend lines. These indicate the 95%
| confidence intervals for the lines.

...

|================================== | 44%
| Before we leave qplot's scatterplotting ability, call qplot again, this time with 3
| arguments. The first is y set equal to hwy, the second is data set equal to mpg, and
| the third is color set equal to drv. Try this now.

qplot(y=hwy,data=mpg,color=drv)

| Great job!

|==================================== | 46%
| What's this plot showing? We see the x-axis ranges from 0 to 250 and we remember
| that we had 234 data points in our set, so we can infer that each point in the plot
| represents one of the hwy values (indicated by the y-axis). We've created the vector
| myhigh for you which contains the hwy data from the mpg dataset. Look at myhigh now.

play()

| Entering play mode. Experiment as you please, then type nxt() when you are ready to
| resume the lesson.

qplot(y=hwy,data=mpg)

nxt()

| Resuming lesson...

| What's this plot showing? We see the x-axis ranges from 0 to 250 and we remember
| that we had 234 data points in our set, so we can infer that each point in the plot
| represents one of the hwy values (indicated by the y-axis). We've created the vector
| myhigh for you which contains the hwy data from the mpg dataset. Look at myhigh now.

myhigh

  [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17 26 23 26 25
[28] 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23 23 19 18 17 17 19 19
[55] 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16 17 15 17 17 18 17 19 17 19
[82] 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23 22 20 33 32 32 29 32 34 36 36 29
[109] 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20 17 12 19 18 14 15 18 18 15 17
[136] 16 18 17 19 19 17 29 27 31 32 27 26 26 25 25 17 17 20 18 26 26 27 28 25 25 24 27
[163] 25 26 23 26 26 26 26 25 27 25 27 20 20 19 17 20 17 29 27 31 31 26 26 28 27 29 31
[190] 31 26 26 27 30 33 35 37 35 15 18 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29
[217] 29 29 29 23 24 44 41 29 26 28 29 29 29 28 29 26 26 26

| You got it!

|====================================== | 49%
| Comparing the values of myhigh with the plot, we see the first entries in the vector
| (29, 29, 31, 30,...) correspond to the leftmost points in the the plot (in order),
| and the last entries in myhigh (28, 29, 26, 26, 26) correspond to the rightmost
| plotted points. So, specifying the y parameter only, without an x argument, plots
| the values of the y argument in the order in which they occur in the data.

...

|======================================= | 51%
| The all-purpose qplot can also create box and whisker plots. Call qplot now with 4
| arguments. First specify the variable by which you'll split the data, in this case
| drv, then specify the variable which you want to examine, in this case hwy. The
| third argument is data (set equal to mpg), and the fourth, the geom, set equal to
| the string "boxplot"

qplot(drv,hwy,data=mpg,geom="boxplot")

|========================================= | 54%
| We see 3 boxes, one for each drive. Now to impress you, call qplot with 5 arguments.
| The first 4 are just as you used previously, (drv, hwy, data set equal to mpg, and
| geom set equal to the string "boxplot"). Now add a fifth argument, color, equal to
| manufacturer.

qplot(drv,hwy,data=mpg,geom="boxplot",color=manufacturer)

| You are amazing!

|=========================================== | 56%
| It's a little squished but we just wanted to illustrate qplot's capabilities. Notice
| that there are still 3 regions of the plot (determined by the factor drv). Each is
| subdivided into several boxes depicting different manufacturers.

...

|============================================= | 59%
| Now, on to histograms. These display frequency counts for a single variable. Let's
| start with an easy one. Call qplot with 3 arguments. First specify the variable for
| which you want the frequency count, in this case hwy, then specify the data (set
| equal to mpg), and finally, the aesthetic, fill, set equal to drv. Instead of a plain
| old histogram, this will again use colors to distinguish the 3 different drive
| factors.

qplot(hwy,data=mpg,fill=drv)
stat_bin() using bins = 30. Pick better value with binwidth.

|=============================================== | 61%
| See how qplot consistently uses the colors. Red (if 4-wheel drv is in the bin) is at
| the bottom of the bin, then green on top of it (if present), followed by blue (rear
| wheel drv). The color lets us see right away that 4-wheel drive vehicles in this
| dataset don't have gas mileages exceeding 30 miles per gallon.

...

|================================================= | 63%
| It's cool that qplot can do this so easily, but some people may find this multi-color
| histogram hard to interpret. Instead of using colors to distinguish between the drive
| factors let's use facets or panels. (That's what lattice called them.) This just
| means we'll split the data into 3 subsets (according to drive) and make 3 smaller
| individual plots of each subset in one plot (and with one call to qplot).

...

|=================================================== | 66%
| Remember that with base plot we had to do each subplot individually. The lattice
| system made plotting conditioning plots easier. Let's see how easy it is with qplot.

...

|===================================================== | 68%
| We'll do two plots, a scatterplot and then a histogram, each with 3 facets. For the
| scatterplot, call qplot with 4 arguments. The first two are displ and hwy and the
| third is the argument data set equal to mpg. The fourth is the argument facets which
| will be set equal to the expression . ~ drv which is ggplot2's shorthand for number
| of rows (to the left of the ~) and number of columns (to the right of the ~). Here
| the . indicates a single row and drv implies 3, since there are 3 distinct drive
| factors. Try this now.

qplot(displ,hwy,data=mpg,facets=.~drv)

| Nice work!

|====================================================== | 71%
| The result is a 1 by 3 array of plots. Note how each is labeled at the top with the
| factor label (4,f, or r). This shows us more detailed information than the histogram.
| We see the relationship between displacement and highway mileage for each of the 3
| drive factors.

...

|======================================================== | 73%
| Now we'll do a histogram, again calling qplot with 4 arguments. This time, since we
| need only one variable for a histogram, the first is hwy and the second is the
| argument data set equal to mpg. The third is the argument facets which we'll set
| equal to the expression drv ~ . . This will give us a different arrangement of the
| facets. The fourth argument is binwidth. Set this equal to 2. Try this now.

qplot(hwy,data=mpg,facets=drv~.,binwidth=2)

| Keep working like that and you'll get there!

|========================================================== | 76%
| The facets argument, drv ~ ., resulted in what arrangement of facets?

1: 2 by 2
2: 1 by 3
3: 3 by 1
4: huh?

Selection: 3

| Nice work!

|============================================================ | 78%
| Pretty good, right? Not too difficult either. Let's review what we learned!

...

|============================================================== | 80%
| Which of the following is a basic workhorse function of ggplot2?

1: gplot
2: scatterplot
3: qplot
4: xyplot
5: hist

Selection: 3

| All that practice is paying off!

|================================================================ | 83%
| Which types of plot does qplot plot?

1: scatterplots
2: all of the others
3: histograms
4: box and whisker plots

Selection: 2

| Great job!

|================================================================== | 85%
| What does the gg in ggplot2 stand for?

1: goto graphics
2: grammar of graphics
3: good grief
4: good graphics

Selection: 2

|==================================================================== | 88%
| True or False? The geom argument takes a string for a value.

1: False
2: True

Selection: 2

| Nice work!

|===================================================================== | 90%
| True or False? The data argument takes a string for a value.

1: True
2: False

Selection: 2

| You're the best!

|======================================================================= | 93%
| True or False? The binwidth argument takes a string for a value.

1: False
2: True

Selection: 1

| Great job!

|========================================================================= | 95%
| True or False? The user must specify x- and y-axis labels when using qplot.

1: True
2: False

Selection: 2

| All that practice is paying off!

|=========================================================================== | 98%
| Congrats! You've finished plot 1 of ggplot2. In the next lesson the plot thickens.

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: Yes
2: No

Selection: 1
What is your assignment token? xXxXxxXXxXxxXXXx

| That's a job well done!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

Last updated 2020-05-08 20:04:09.107833 IST

# Working with Colors

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

library(ggplot2)
library(swirl)

| Hi! Type swirl() when you are ready to begin.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as
| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

 1: Principles of Analytic Graphs   2: Exploratory Graphs
3: Graphics Devices in R           4: Plotting Systems
5: Base Plotting System            6: Lattice Plotting System
7: Working with Colors             8: GGPlot2 Part1
9: GGPlot2 Part2                  10: GGPlot2 Extras
11: Hierarchical Clustering        12: K Means Clustering
13: Dimension Reduction            14: Clustering Example
15: CaseStudy

Selection: 7

| Attempting to load lesson dependencies...

| | 0%

| Working_with_Colors. (Slides for this and other Data Science courses may be found at
| github https://github.com/DataScienceSpecialization/courses/. If you care to use
| them, they must be downloaded as a zip file and viewed locally. This lesson
| corresponds to 04_ExploratoryAnalysis/Colors.)

...

|= | 1%
| This lesson is about using colors in R. It really supplements the lessons on
| plotting with the base and lattice packages which contain functions that are able to
| take the argument col. We'll discuss ways to set this argument more colorfully.

...

|== | 3%
| Of course, color choice is secondary to your data and how you analyze it, but
| effectively using colors can enhance your plots and presentations, emphasizing the
| important points you're trying to convey.

...

|=== | 4%
| The motivation for this lesson is that the default color schemes for most plots in R
| are not optimal. Fortunately there have been recent developments to improve the
| handling and specification of colors in plots and graphs. We'll cover some functions
| in R as well as in external packages that are very handy. If you know how to use
| some of these then you'll have more options when you create your displays.

...

|==== | 6%
| We'll begin with a motivating example - a typical R plot using 3 default colors.

...

|====== | 7%
| According to the plot, what is color 2?

1: Blue
2: Empty black circles
3: Red
4: Green

Selection: 3

| Nice work!

|======= | 9%
| So these are the first 3 default values. If you were plotting and just specified
| col=c(1:3) as one of your arguments, these are colors you'd get. Maybe you like
| them, but they might not be the best choice for your application.

...

|======== | 10%
| To show you some options, here's a display of two color palettes that come with the
| grDevices package available to you. The left shows you some colors from the function
| heat.colors. Here low values are represented in red and as the values increase the
| colors move through yellow towards white. This is consistent with the physical
| properties of fire. The right display is from the function topo.colors which uses
| topographical colors ranging from blue (low values) towards brown (higher values).

...

|========= | 12%
| So we'll first discuss some functions that the grDevices package offers. The
| function colors() lists the names of 657 predefined colors you can use in any
| plotting function. These names are returned as strings. Run the R command sample
| with colors() as its first argument and 10 as its second to give you an idea of the
| choices you have.

sample(colors(),10)
[1] "gray1" "darkorchid2" "blue3" "darkorchid3" "gray10"
[6] "firebrick1" "magenta3" "gray75" "lemonchiffon4" "rosybrown3"

| Great job!

|========== | 13%
| We see a lot of variety in the colors, some of which are names followed by numbers
| indicating that there are multiple forms of that particular color.

...

|=========== | 14%
| So you're free to use any of these 600+ colors listed by the colors function.
| However, two additional functions from grDevices, colorRamp and colorRampPalette,
| give you more options. Both of these take color names as arguments and use them as
| "palettes", that is, these argument colors are blended in different proportions to
| form new colors.

...

|============ | 16%
| The first, colorRamp, takes a palette of colors (the arguments) and returns a
| function that takes values between 0 and 1 as arguments. The 0 and 1 correspond to
| the extremes of the color palette. Arguments between 0 and 1 return blends of these
| extremes.

...

|============= | 17%
| Let's see what this means. Assign to the variable pal the output of a call to
| colorRamp with the single argument, c("red","blue").

pal<-colorRamp(c("red","blue"))

| You are amazing!

|=============== | 19%
| We don't see any output, but R has created the function pal which we can call with a
| single argument between 0 and 1. Call pal now with the argument 0.

pal(0)
[,1] [,2] [,3]
[1,] 255 0 0

| You are quite good my friend!

|================ | 20%
| We see a 1 by 3 array with 255 as the first entry and 0 in the other 2. This 3 long
| vector corresponds to red, green, blue (RGB) color encoding commonly used in
| televisions and monitors. In R, 24 bits are used to represent colors. Think of these
| 24 bits as 3 sets of 8 bits, each of which represents an intensity for one of the
| colors red, green, and blue.

...

|================= | 22%
| The 255 returned from the pal(0) call corresponds to the largest possible number
| represented with 8 bits, so the vector (255,0,0) contains only red (no green or
| blue), and moreover, it's the highest possible value of red.

...

|================== | 23%
| Given that you created pal with the palette containing "red" and "blue", what color
| do you think will be represented by the vector that pal(1) returns? Recall that pal
| will only take arguments between 0 and 1, so 1 is the largest argument you can pass
| it.

1: blue
2: red
3: green
4: yellow

Selection: 1

| Keep up the great work!

|=================== | 25%
| Check your answer now by calling pal with the argument 1.

pal(1)
[,1] [,2] [,3]
[1,] 0 0 255

| Excellent work!

|==================== | 26%
| You see the vector (0,0,255) which represents the highest intensity of blue. What
| vector do you think the call pal(.5) will return?

1: (0,255,0)
2: (255,255,255)
3: (127.5,0,127.5)
4: (255,0,255)

Selection: 3

| You got it!

|===================== | 28%
| The function pal can take more than one argument. It returns one 3-long (or 4-long,
| but more about this later) vector for each argument. To see this in action, call pal
| with the argument seq(0,1,len=6).

pal(seq(0,1,len=6))

     [,1] [,2] [,3]
[1,]  255    0    0
[2,]  204    0   51
[3,]  153    0  102
[4,]  102    0  153
[5,]   51    0  204
[6,]    0    0  255

| Nice work!

|====================== | 29%
| Six vectors (each of length 3) are returned. The i-th vector is identical to output
| that would be returned by the call pal(i/5) for i=0,...5. We see that the i-th row
| (for i=1,...6) differs from the (i-1)-st row in the following way. Its red entry is
| 51 = 255/5 points lower and its blue entry is 51 points higher.

...

|======================= | 30%
| So pal creates colors using the palette we specified when we called colorRamp. In
| this example none of pal's outputs will ever contain green since it wasn't in our
| initial palette.

...

|========================= | 32%
| We'll turn now to colorRampPalette, a function similar to colorRamp. It also takes a
| palette of colors and returns a function. This function, however, takes integer
| arguments (instead of numbers between 0 and 1) and returns a vector of colors each
| of which is a blend of colors of the original palette.

...

|========================== | 33%
| The argument you pass to the returned function specifies the number of colors you
| want returned. Each element of the returned vector is a 24 bit number, represented
| as 6 hexadecimal characters, which range from 0 to F. This set of 6 hex characters
| represents the intensities of red, green, and blue, 2 characters for each color.

...

|=========================== | 35%
| To see this better, assign to the variable p1 the output of a call to
| colorRampPalette with the single argument, c("red","blue"). We'll compare it to our
| experiments using colorRamp.

p1<-colorRampPalette(c("red","blue"))

| You got it!

|============================ | 36%
| Now call p1 with the argument 2.

p1(2)
[1] "#FF0000" "#0000FF"

| All that hard work is paying off!

|============================= | 38%
| We see a 2-long vector is returned. The first entry FF0000 represents red. The FF is
| hexadecimal for 255, the same value returned by our call pal(0). The second entry
| 0000FF represents blue, also with intensity 255.

...

|============================== | 39%
| Now call p1 with the argument 6. Let's see if we get the same result as we did when
| we called pal with the argument seq(0,1,len=6).

p1(6)
[1] "#FF0000" "#CC0033" "#990066" "#650099" "#3200CC" "#0000FF"

| You are amazing!

|=============================== | 41%
| Now we get the 6-long vector (FF0000, CC0033, 990066, 650099, 3200CC, 0000FF). We
| see the two ends (FF0000 and 0000FF) are consistent with the colors red and blue.
| How about CC0033? Type 0xcc or 0xCC at the command line to see the decimal
| equivalent of this hex number. You must include the 0 before the x to specify that
| you're entering a hexadecimal number.

0xCC
[1] 204

| You are amazing!

|================================ | 42%
| So 0xCC equals 204 and we can easily convert hex 33 to decimal, as in
| 0x33=3*16+3=51. These were exactly the numbers we got in the second row returned
| from our call to pal(seq(0,1,len=6)). We see that 4 of the 6 numbers agree with our
| earlier call to pal. Two of the 6 differ slightly.

...

|================================= | 43%
| We can also form palettes using colors other than red, green and blue. Form a
| palette, p2, by calling colorRampPalette with the colors "red" and "yellow".
| Remember to concatenate them into a single argument.

p2<-colorRampPalette(c("red","yellow"))

| You are really on a roll!

|=================================== | 45%
| Now call p2 with the argument 2. This will show us the two extremes of the blends of
| colors we'll get.

p2(2)
[1] "#FF0000" "#FFFF00"

| Excellent work!

|==================================== | 46%
| Not surprisingly the first color we see is FF0000, which we know represents red. The
| second color returned, FFFF00, must represent yellow, a combination of full
| intensity red and full intensity green. This makes sense, since yellow falls between
| red and green on the color wheel as we see here. (We borrowed this image from

...

|===================================== | 48%
| Let's now call p2 with the argument 10. This will show us how the two extremes, red
| and yellow, are blended together.

p2(10)
[1] "#FF0000" "#FF1C00" "#FF3800" "#FF5500" "#FF7100" "#FF8D00" "#FFAA00" "#FFC600"
[9] "#FFE200" "#FFFF00"

|====================================== | 49%
| So we see the 10-long vector. For each element, the red component is fixed at FF,
| and the green component grows from 00 (at the first element) to FF (at the last).

...

|======================================= | 51%
| This is all fine and dandy but you're probably wondering when you can see how all
| these colors show up in a display. We copied some code from the R documentation
| pages (color.scale if you're interested) and created a function for you, showMe.
| This takes as an argument, a color vector, which as you know, is precisely what
| calls to p1 and p2 return to you. Call showMe now with p1(20).

showMe(p1(20))

| That's the answer I was looking for.

|======================================== | 52%
| We see the interpolated palette here. Low values in the lower left corner are red
| and as you move to the upper right, the colors move toward blue. Now call showMe
| with p2(20) as its argument.

showMe(p2(20))

| You're the best!

|========================================= | 54%
| Here we see a similar display, the colors moving from red to yellow, the base colors
| of our p2 palette. For fun, see what p2(2) looks like using showMe.

showMe(p2(2))

| You are really on a roll!

|========================================== | 55%
| A much more basic pattern, simple but elegant.

...

|============================================ | 57%
| We mentioned before that colorRamp (and colorRampPalette) could return a 3 or 4 long
| vector of colors. We saw 3-long vectors returned indicating red, green, and blue
| intensities. What would the 4th entry be?

...

|============================================= | 58%
| We'll answer this indirectly. First, look at the function p1 that colorRampPalette
| returned to you. Just type p1 at the command prompt.

p1
function (n)
{
x <- ramp(seq.int(0, 1, length.out = n))
if (ncol(x) == 4L)
rgb(x[, 1L], x[, 2L], x[, 3L], x[, 4L], maxColorValue = 255)
else rgb(x[, 1L], x[, 2L], x[, 3L], maxColorValue = 255)
}

<bytecode: 0x00000174e0c71940>

<environment: 0x00000174dbdd5a00>

| Keep up the great work!

|============================================== | 59%
| We see that p1 is a short function with one argument, n. The argument n is used as
| the length in a call to the function seq.int, itself an argument to the function
| ramp. We can infer that ramp is just going to divide the interval from 0 to 1 into n
| pieces.

...

|=============================================== | 61%
| The heart of p1 is really the call to the function rgb with either 4 or 5 arguments.
| Use the ?fun construct to look at the R documentation for rgb now.

?rgb

| You got it!

|================================================ | 62%
| We see that rgb is a color specification function that can be used to produce any
| color with red, green, blue proportions. We see the maxColorValue is 1 by default,
| so if we called rgb with values for red, green and blue, we would specify numbers at
| most 1 (assuming we didn't change the default for maxColorValue). According to the
| documentation, what is the maximum number of arguments rgb can have?

1: 6
2: 4
3: 5
4: 3

Selection: 1

| All that practice is paying off!

|================================================= | 64%
| So the fourth argument is alpha which can be a logical, i.e., either TRUE or FALSE,
| or a numerical value. Create the function p3 now by calling colorRampPalette with
| the colors blue and green (remember to concatenate them into a single argument) and
| setting the alpha argument to .5.

p3<-colorRampPalette(c("blue","green"),alpha=0.5)

| You are really on a roll!

|================================================== | 65%
| Now call p3 with the argument 5.

p3(5)
[1] "#0000FFFF" "#003FBFFF" "#007F7FFF" "#00BF3FFF" "#00FF00FF"

|=================================================== | 67%
| We see that in the 5-long vector that the call returned, each element has 32 bits, 4
| groups of 8 bits each. The last 8 bits represent the value of alpha. Since it was
| NOT ZERO in the call to colorRampPalette, it gets the maximum FF value. (The same
| result would happen if alpha had been set to TRUE.) When it was 0 or FALSE (as in
| previous calls to colorRampPalette) it was given the value 00 and wasn't shown. The
| leftmost 24 bits of each element are the same RGB encoding we previously saw.

...

|==================================================== | 68%
| So what is alpha? Alpha represents an opacity level, that is, how transparent should
| the colors be. We can add color transparency with the alpha parameter to calls to
| rgb. We haven't seen any examples of this yet, but we will now.

...

|====================================================== | 70%
| We generated 1000 random normal pairs for you in the variables x and y. We'll plot
| them in a scatterplot by calling plot with 4 arguments. The variables x and y are
| the first 2. The third is the print character argument pch. Set this equal to 19
| (filled circles). The final argument is col which should be set equal to a call to
| rgb. Give rgb 3 arguments, 0, .5, and .5.

plot(x,y,pch=19,col=rgb(0,0.5,0.5))

|======================================================= | 71%
| Well this picture is okay for a scatterplot, a nice mix of blue and green, but it
| really doesn't tell us too much information in the center portion, since the points
| are so thick there. We see there are a lot of points, but is one area more filled
| than another? We can't really discriminate between different point densities. This
| is where the alpha argument can help us. Recall your plot command (use the up arrow)
| and add a 4th argument, .3, to the call to rgb. This will be our value for alpha.

plot(x,y,pch=19,col=rgb(0,0.5,0.5,0.3))

| You are amazing!

|======================================================== | 72%
| Clearly this is better. It shows us where, specifically, the densest areas of the
| scatterplot really are.

...

|========================================================= | 74%
| Our last topic for this lesson is the RColorBrewer Package, available on CRAN, that
| contains interesting and useful color palettes, of which there are 3 types,
| sequential, divergent, and qualitative. Which one you would choose to use depends on

...

|========================================================== | 75%
| Here's a picture of the palettes available from this package. The top section shows
| the sequential palettes in which the colors are ordered from light to dark. The
| divergent palettes are at the bottom. Here the neutral color (white) is in the
| center, and as you move from the middle to the two ends of each palette, the colors
| increase in intensity. The middle display shows the qualitative palettes which look
| like collections of random colors. These might be used to distinguish factors in

...

|=========================================================== | 77%
| These colorBrewer palettes can be used in conjunction with the colorRamp() and
| colorRampPalette() functions. You would use colors from a colorBrewer palette as
| your base palette,i.e., as arguments to colorRamp or colorRampPalette which would
| interpolate them to create new colors.

...

|============================================================ | 78%
| As an example of this, create a new object, cols by calling the function brewer.pal
| with 2 arguments, 3 and "BuGn". The string "BuGn" is the second last palette in the
| sequential display. The 3 tells the function how many different colors we want.

cols<-brewer.pal(3,"BuGn")

| That's the answer I was looking for.

|============================================================= | 80%
| Use showMe to look at cols now.

showMe(cols)

| Keep up the great work!

|============================================================== | 81%
| We see 3 colors, mixes of blue and green. Now create the variable pal by calling
| colorRampPalette with cols as its argument.

pal<-colorRampPalette(cols)

| All that hard work is paying off!

|================================================================ | 83%
| The call showMe(pal(3)) would be identical to the showMe(cols) call. So use showMe
| to look at pal(20).

showMe(pal(20))

| Keep up the great work!

|================================================================= | 84%
| Now we can use the colors in pal(20) to display topographic information on
| Auckland's Maunga Whau Volcano. R provides this information in a matrix called
| volcano which is included in the package datasets. Call the R function image with
| volcano as its first argument and col set equal to pal(20) as its second.

image(volcano,col=pal(20))

| Great job!

|================================================================== | 86%
| We see that the colors here of the sequential palette clue us in on the topography.
| The darker colors are more concentrated than the lighter ones. Just for fun, recall
| your last command calling image and instead of pal(20), use p1(20) as the second
| argument.

image(volcano,col=p1(20))

|=================================================================== | 87%
| Not as nice a picture since the palette isn't as well suited to this data, but
| that's okay. It's review time!!!!

...

|==================================================================== | 88%
| True or False? Careful use of colors in plots/maps/etc. can make it easier for the
| reader to understand what points you're trying to convey.

1: False
2: True

Selection: 2

| You got it!

|===================================================================== | 90%
| Which of the following is an R package that provides color palettes for sequential,
| categorical, and diverging data?

1: RColorBluer
2: RColorBrewer
3: RColorStewer
4: RColorVintner

Selection: 2

| Keep working like that and you'll get there!

|====================================================================== | 91%
| True or False? The colorRamp and colorRampPalette functions can be used in
| conjunction with color palettes to connect data to colors.

1: False
2: True

Selection: 2

| You are really on a roll!

|======================================================================= | 93%
| True or False? Transparency can NEVER be used to clarify plots with many points

1: True
2: False

Selection: 2

| Excellent work!

|========================================================================= | 94%
| True or False? The call p7 <- colorRamp("red","blue") would work (i.e., not
| generate an error).

1: True
2: False

Selection: 2

| Excellent job!

|========================================================================== | 96%
| True or False? The function colors returns only 10 colors.

1: False
2: True

Selection: 1

| All that practice is paying off!

|=========================================================================== | 97%
| Transparency is determined by which parameter of the rgb function?

1: beta
2: gamma
3: it's all Greek to me
4: delta
5: alpha

Selection: 5

| You got it right!

|============================================================================ | 99%
| Congratulations! We hope this lesson didn't make you see red. We're green with envy
| that you blue through it.

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: Yes
2: No

Selection: 1
What is your assignment token? xXxXxxXXxXxxXXXx

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

Last updated 2020-05-08 19:05:14.923558 IST

# Lattice Plotting System

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

library(ggplot2)
library(lattice)
library(swirl)

| Hi! Type swirl() when you are ready to begin.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as
| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

1: Principles of Analytic Graphs 2: Exploratory Graphs
3: Graphics Devices in R 4: Plotting Systems
5: Base Plotting System 6: Lattice Plotting System
7: Working with Colors 8: GGPlot2 Part1
9: GGPlot2 Part2 10: GGPlot2 Extras
11: Hierarchical Clustering 12: K Means Clustering
13: Dimension Reduction 14: Clustering Example
15: CaseStudy

Selection: 6

| Attempting to load lesson dependencies...

| | 0%

| Lattice_Plotting_System. (Slides for this and other Data Science courses may be
| found at github https://github.com/DataScienceSpecialization/courses/. If you care
| to use them, they must be downloaded as a zip file and viewed locally. This lesson
| corresponds to 04_ExploratoryAnalysis/PlottingLattice.)

...

|= | 1%
| In another lesson, we gave you an overview of the three plotting systems in R. In
| this lesson we'll focus on the lattice plotting system. As we did with the base
| plotting system, we'll focus on using lattice to create graphics on the screen
| device rather than another graphics device.

...

|== | 3%
| The lattice plotting system is completely separate and independent of the base
| plotting system. It's an add-on package so it has to be explicitly loaded with a
| call to the R function library. We've done this for you. The R Documentation tells
| us that lattice "is an implementation of Trellis graphics for R. It is a powerful
| and elegant high-level data visualization system with an emphasis on multivariate
| data."

...

|=== | 4%
| Lattice is implemented using two packages. The first is called, not surprisingly,
| lattice, and it contains code for producing Trellis graphics. Some of the functions
| in this package are the higher level functions which you, the user, would call.
| These include xyplot, bwplot, and levelplot.

...

|===== | 6%
| If xyplot produces a scatterplot, what kind of plot does bwplot produce?

1: box and whisker
2: big and whittle
4: black and white

Selection: 1

| That's a job well done!

|====== | 7%
| The second package in the lattice system is grid which contains the low-level
| functions upon which the lattice package is built. You, the user, seldom call
| functions from the grid package directly.

...

|======= | 9%
| Unlike base plotting, the lattice system does not have a "two-phase" aspect with
| separate plotting and annotation. Instead all plotting and annotation is done at
| once with a single function call.

...

|======== | 10%
| The lattice system, as the base does, provides several different plotting functions.
| These include xyplot for creating scatterplots, bwplot for box-and-whiskers plots or
| boxplots, and histogram for histograms. There are several others (stripplot,
| dotplot, splom and levelplot), which we won't cover here.

...

|========= | 12%
| Lattice functions generally take a formula for their first argument, usually of the
| form y ~ x. This indicates that y depends on x, so in a scatterplot y would be
| plotted on the y-axis and x on the x-axis.

...

|========== | 13%
| Here's an example of typical lattice plot call, xyplot(y ~ x | f g, data). The f
| and g represent the optional conditioning variables. The
represents interaction
| between them. Remember when we said that lattice is good for plotting multivariate
| data? That's where these conditioning variables come into play.

...

|=========== | 15%
| The second argument is the data frame or list from which the variables in the
| formula should be looked up. If no data frame or list is passed, then the parent
| frame is used. If no other arguments are passed, the default values are used.

...

|============= | 16%
| Recall the airquality data we've used before. We've loaded it again for you. To
| remind yourself what it looks like run the R command head with airquality as an
| argument to see what the data looks like.

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

| You got it right!

|============== | 18%
| Now try running xyplot with the formula Ozone~Wind as the first argument and the
| second argument data set equal to airquality.

xyplot(Ozone~Wind,data=airquality)

| That's a job well done!

|=============== | 19%
| Look vaguely familiar? The dots are blue, instead of black, but lattice labeled the
| axes for you. You can use some of the same graphical parameters (e.g., pch and col)
| that you used in the base package in calls to lattice functions.

...

|================ | 21%
| Now rerun xyplot with the formula Ozone~Wind as the first argument and the second
| argument data set equal to airquality (use the up arrow to save typing). This time
| add the arguments col set equal to "red", pch set equal to 8, and main set equal to
| "Big Apple Data".

xyplot(Ozone ~ Wind, data = airquality, pch=8, col="red", main="Big Apple Data")

| You are really on a roll!

|================= | 22%
| Red snowflakes are cool, right? Now that you’ve seen the basic xyplot() and some of
| its arguments, you might want to experiment more by yourself when you're done with
| the lesson to discover what other arguments and colors are available. (If you can't
| wait to experiment, recall that swirl has play() and nxt() functions. At a command
| prompt, typing play() allows you to leave swirl temporarily so you can try different
| R commands at the console. Typing nxt() when you’re done playing brings you back to
| swirl and you can resume your lesson.)

...

|================== | 24%
| Now you'll see how easy it is to generate a multipanel plot using a single lattice
| command.

...

|==================== | 25%
| Run xyplot with the formula Ozone~Wind | as.factor(Month) as the first argument and
| the second argument data set equal to airquality (use the up arrow to save typing).
| So far, not much is different, right? Add a third argument, layout, set equal to
| c(5,1).

xyplot(Ozone~Wind|as.factor(Month),data=airquality,layout=c(5,1))

| Great job!

|===================== | 27%
| Note that the default color and plotting character are back. What did the
| as.factor(Month) do?

1: Randomly divided the data into 5 panels
2: Huh?
3: Displayed and labeled each subplot with the month's integer
4: Displayed the data by individual months

Selection: 3

|====================== | 28%
| Since Month is a named column of the airquality dataframe we had to tell R to treat
| it as a factor. To see how this affects the plot, rerun the xyplot command you just
| ran, but use Ozone ~ Wind | Month instead of Ozone ~ Wind | as.factor(Month) as the
| first argument.

xyplot(Ozone~Wind|Month,data=airquality,layout=c(5,1))

| Keep working like that and you'll get there!

|======================= | 30%
| Not as informative, right? The word Month in each panel really doesn't tell you much
| if it doesn't identify which month it's plotting. Notice that the actual data is the
| same between the two plots, though.

...

|======================== | 31%
| Lattice functions behave differently from base graphics functions in one critical
| way. Recall that base graphics functions plot data directly to the graphics device
| (e.g., screen, or file such as a PDF file). In contrast, lattice graphics functions
| return an object of class trellis.

...

|========================= | 33%
| The print methods for lattice functions actually do the work of plotting the data on
| the graphics device. They return "plot objects" that can be stored (but it’s usually
| better to just save the code and data). On the command line, trellis objects are
| auto-printed so that it appears the function is plotting the data.

...

|========================== | 34%
| To see this, create a variable p which is assigned the output of this simple call to
| xyplot, xyplot(Ozone~Wind,data=airquality).

p<-xyplot(Ozone~Wind,data=airquality)

| You are amazing!

|============================ | 36%
| Nothing plotted, right? But the object p is around.

...

|============================= | 37%
| Type p or print(p) now to see it.

p

| Excellent job!

|============================== | 39%
| Like magic, it appears. Now run the R command names with p as its argument.

names(p)
[1] "formula" "as.table" "aspect.fill" "legend"
[5] "panel" "page" "layout" "skip"
[9] "strip" "strip.left" "xscale.components" "yscale.components"
[13] "axis" "xlab" "ylab" "xlab.default"
[17] "ylab.default" "xlab.top" "ylab.right" "main"
[21] "sub" "x.between" "y.between" "par.settings"
[25] "plot.args" "lattice.options" "par.strip.text" "index.cond"
[29] "perm.cond" "condlevels" "call" "x.scales"
[33] "y.scales" "panel.args.common" "panel.args" "packet.sizes"
[37] "x.limits" "y.limits" "x.used.at" "y.used.at"
[41] "x.num.limit" "y.num.limit" "aspect.ratio" "prepanel.default"
[45] "prepanel"

| All that practice is paying off!

|=============================== | 40%
| We see that the trellis object p has 45 named properties, the first of which is
| "formula" which isn't too surprising. A lot of these properties are probably NULL in
| value. We've done some behind-the-scenes work for you and created two vectors. The
| first, mynames, is a character vector of the names in p. The second is a boolean
| vector, myfull, which has TRUE values for nonnull entries of p. Run mynames[myfull]
| to see which entries of p are not NULL.

mynames[myfull]
[1] "formula" "as.table" "aspect.fill" "panel"
[5] "skip" "strip" "strip.left" "xscale.components"
[9] "yscale.components" "axis" "xlab" "ylab"
[13] "xlab.default" "ylab.default" "x.between" "y.between"
[17] "index.cond" "perm.cond" "condlevels" "call"
[21] "x.scales" "y.scales" "panel.args.common" "panel.args"
[25] "packet.sizes" "x.limits" "y.limits" "aspect.ratio"
[29] "prepanel.default"

| That's the answer I was looking for.

|================================ | 42%
| Wow! 29 nonNull values for one little plot. Note that a lot of them are like the
| ones we saw in the base plotting system. Let's look at the values of some of them.
| Type p[["formula"]] now.

p[["formula"]]
Ozone ~ Wind

| You are amazing!

|================================= | 43%
| Not surprising, is it? It's a familiar formula. Now look at p's x.limits. Remember
| the double square brackets and quotes.

p[["x.limits"]]
[1] 0.37 22.03

| Keep up the great work!

|================================== | 45%
| They match the plot, right? The x values are indeed between .37 and 22.03.

...

|==================================== | 46%
| Again, not surprising. Before we wrap up, let's talk about lattice's panel functions
| which control what happens inside each panel of the plot. The ease of making
| multi-panel plots makes lattice very appealing. The lattice package comes with
| default panel functions, but you can customize what happens in each panel.

...

|===================================== | 48%
| Panel functions receive the x and y coordinates of the data points in their panel
| (along with any optional arguments). To see this, we've created some data for you -
| two 100-long vectors, x and y. For its first 50 values y is a function of x, for the
| last 50 values, y is random. We've also defined a 100-long factor vector f which
| distinguishes between the first and last 50 elements of the two vectors. Run the R
| command table with f as it argument.

table(f)

f
Group 1 Group 2
50      50

| That's a job well done!

|====================================== | 49%
| The first 50 entries of f are "Group 1" and the last 50 are "Group 2". Run xyplot
| with two arguments. The first is the formula y~x|f, and the second is layout set
| equal to c(2,1). Note that we're not providing an explicit data argument, so xyplot
| will look in the environment and see the x and y that we've generated for you.

xyplot(y~x|f,layout=c(2,1))

| You nailed it! Good job!

|======================================= | 51%
| To understand this a little better look at the variable v1 we've created for you.

v1
[1] -2.185287 1.101780 -2.716851 1.569850

| You're the best!

|======================================== | 52%
| The first two numbers are the range of the x values of Group 1 and the last two
| numbers are the range of y values of Group 1. See how they match the values of the
| left panel (Group 1) in the plot. Now look at v2 which holds the comparable numbers
| for Group 2.

v2
[1] -1.6066772 2.2205197 -0.1605085 2.0341048

| You nailed it! Good job!

|========================================= | 54%
| Again, the values match the plot. That's reassuring. We've copied some code from the
| slides for you. To see it, type myedit("plot1.R"). This will open your editor and
| display the R code in it.

myedit("plot1.R")

p <- xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)  ## First call the default panel function for 'xyplot'
panel.abline(h = median(y), lty = 2)  ## Add a horizontal line at the median
})
print(p)
invisible()

| You are quite good my friend!

|=========================================== | 55%
| How many calls to basic lattice plotting functions are there in plot1.R?

1: 1
2: 2
3: 3

Selection: 1

| You got it!

|============================================ | 57%
| Note the panel function. How many formal arguments does it have?

1: 2
2: 3
3: 1

Selection: 1

| One more time. You can do it!

| You have to count the ... as an argument?

1: 1
2: 3
3: 2

Selection: 2

| Excellent job!

|============================================= | 58%
| The panel function has 3 arguments, x, y and ... . This last stands for all other
| arguments (such as graphical parameters) you might want to include. There are 2
| lines in the panel function. Each invokes a panel method, the first to plot the data
| in each panel (panel.xyplot), the second to draw a horizontal line in each panel
| (panel.abline). Note the similarity of this last call to that of the base plotting
| function of the same name.

...

|============================================== | 60%
| We've defined a function for you, pathtofile, which takes a filename as its
| argument. This makes sure R can find the file on your computer. Now run the R
| command source with two arguments. The first is the call to pathtofile with the
| string "plot1.R" as its argument and the second is the argument local set equal to
| TRUE. This command will run the code contained in plot1.R within the swirl
| environment so you can see what it does.

source(pathtofile("plot1.R"),local=TRUE)

| That's the answer I was looking for.

|=============================================== | 61%
| See how the lines appear. The plot shows two panels because...?

1: there are 2 calls to panel methods
2: f contains 2 factors
3: there are 2 variables
4: lattice can handle at most 2 panels

Selection: 2

| All that hard work is paying off!

|================================================ | 63%
| We've copied another piece of similar code, i.e., a call to xyplot with a custom
| panel function, from the slides. To see it, type myedit("plot2.R"). This will open
| your editor and display the R code in it.

myedit("plot2.R")

p2 <- xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)  ## First call default panel function
panel.lmline(x, y, col = 2)  ## Overlay a simple linear regression line
})
print(p2)
invisible()

| You nailed it! Good job!

|================================================= | 64%
| You can see how plot2.R differs from plot1.R, right?

...

|=================================================== | 66%
| Again, run the R command source with the two arguments pathtofile("plot2.R") and
| local=TRUE. This will run the code in plot2.R.

source(pathtofile("plot2.R"),local=TRUE)

| You are doing so well!

|==================================================== | 67%
| The regression lines are red because ...?

1: R always plots regression lines in red
2: R is the first letter of the word red
3: the custom panel function specified a col argument

Selection: 3

| Excellent job!

|===================================================== | 69%
| Before we close we'll look at how easily lattice can handle a plot with a great many
| panels. (The sky's the limit.) We've loaded some diamond data for you. It comes with
| the ggplot2 package. We'll use it just to show off lattice's panel plotting
| capability.

...

|====================================================== | 70%
| The data is in the data frame diamonds. Use the R command str to see what it looks
| like.

str(diamonds)

tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
$carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...$ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...$ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...$ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...$ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

|======================================================= | 72%
| So the data frame contains 10 pieces of information for each of 53940 diamonds. Run
| the R command table with diamonds$color as an argument. table(diamonds$color)

    D     E     F     G     H     I     J
6775  9797  9542 11292  8304  5422  2808

| You nailed it! Good job!

|======================================================== | 73%
| We see 7 colors each represented by a letter. Now run the R command table with two
| arguments, diamonds$color and diamonds$cut.

table(diamonds$color,diamonds$cut)

    Fair Good Very Good Premium Ideal
D  163  662      1513    1603  2834
E  224  933      2400    2337  3903
F  312  909      2164    2331  3826
G  314  871      2299    2924  4884
H  303  702      1824    2360  3115
I  175  522      1204    1428  2093
J  119  307       678     808   896

| That's a job well done!

|========================================================= | 75%
| We see a 7 by 5 array with counts indicating how many diamonds in the data frame
| have a particular color and cut. From the table, which is the most frequent
| combination?

1: Ideal cut of color F.
2: Ideal color of cut G
3: Premium cut of color G
4: Ideal cut of color G

Selection: 4

| Keep up the great work!

|=========================================================== | 76%
| To save you some trouble we've defined three character strings for you, labels for
| the x- and y-axes and a main title. They're in the file myLabels.R, so run myedit on
| this file to see them. Remember to put the file name in quotes when you call myedit.

myedit("myLabels.R")

myxlab <- "Carat"
myylab <- "Price"
mymain <- "Diamonds are Sparkly!"

| Excellent job!

|============================================================ | 78%
| Now run source with pathtofile("myLabels.R") and local set equal to TRUE.

source(pathtofile("myLabels.R"),local=TRUE)

| All that hard work is paying off!

|============================================================= | 79%
| Now call xyplot with the formula price~carat | color*cut and data set equal to
| diamonds. In addition, set the argument strip equal to FALSE, pch set equal to 20,
| xlab to myxlab, ylab to myylab, and main to mymain. The plot may take longer than
| previous plots because it is bigger.

xyplot(price~carat|color*cut,data=diamonds,strip=FALSE,pch=20,xlab=myxlab,ylab=myylab,main=mymain)

| Excellent work!

|============================================================== | 81%
| Pretty cool, right? 35 panels, one for each combination of color and cut. The dots
| (pch=20) show how prices for the diamonds in each category (panel) vary depending on
| carat.

...

|=============================================================== | 82%
| Are colors defining the rows or columns of the plot?

1: columns
2: rows

Selection: 1

| You got it!

|================================================================ | 84%
| Were you curious about that argument strip? I know I was. Now rerun the xyplot
| command you just ran (use the up arrow key to retrieve it), this time without the
| strip argument.

xyplot(price~carat|color*cut,data=diamonds,pch=20,xlab=myxlab,ylab=myylab,main=mymain)

| All that hard work is paying off!

|================================================================== | 85%
| The plot shows that the strip argument ....

1: labels each panel
2: removes information from the plot
3: makes the plot less intelligible
4: has a default value of FALSE

Selection: 1

| All that practice is paying off!

|=================================================================== | 87%
| Review time!!!

...

|==================================================================== | 88%
| True or False? Lattice plots are constructed by a series of calls to core functions.

1: False
2: True

Selection: 1

| That's the answer I was looking for.

|===================================================================== | 90%
| True or False? Lattice plots are constructed with a single function call to a core
| lattice function (e.g. xyplot)

1: False
2: True

Selection: 2

| You got it!

|====================================================================== | 91%
| True or False? Aspects like margins and spacing are automatically handled and
| defaults are usually sufficient.

1: False
2: True

Selection: 2

| That's a job well done!

|======================================================================= | 93%
| True or False? The lattice system is ideal for creating conditioning plots where you
| examine the same kind of plot under many different conditions.

1: False
2: True

Selection: 2

| Excellent job!

|======================================================================== | 94%
| True or False? The lattice system, like the base plotting system, returns a trellis
| plot object.

1: False
2: True

Selection: 1

| Nice work!

|========================================================================== | 96%
| True or False? Panel functions can NEVER be customized to modify what is plotted in
| each of the plot panels.

1: True
2: False

Selection: 2

|=========================================================================== | 97%
| True or False? Lattice plots can display at most 20 panels in a single plot.

1: False
2: True

Selection: 1

| You are doing so well!

|============================================================================ | 99%
| Congrats! We hope this lesson didn't leave you climbing the trellis.

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: No
2: Yes

Selection: 2
What is your assignment token? xXxXxxXXxXxxXXXx

| Keep up the great work!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

Last updated 2020-05-08 12:20:33.102850 IST

# Base Plotting System

library(swirl)
swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as you did
| then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

1: Principles of Analytic Graphs 2: Exploratory Graphs
3: Graphics Devices in R 4: Plotting Systems
5: Base Plotting System 6: Lattice Plotting System
7: Working with Colors 8: GGPlot2 Part1
9: GGPlot2 Part2 10: GGPlot2 Extras
11: Hierarchical Clustering 12: K Means Clustering
13: Dimension Reduction 14: Clustering Example
15: CaseStudy

Selection: 5
| | 0%

| Base_Plotting_System. (Slides for this and other Data Science courses may be found at
| github https://github.com/DataScienceSpecialization/courses/. If you care to use them,
| they must be downloaded as a zip file and viewed locally. This lesson corresponds to
| 04_ExploratoryAnalysis/PlottingBase.)

...

|= | 2%
| In another lesson, we gave you an overview of the three plotting systems in R. In this
| lesson we'll focus on the base plotting system and talk more about how you can exploit all
| its many parameters to get the plot you want. We'll focus on using the base plotting
| system to create graphics on the screen device rather than another graphics device.

...

|=== | 3%
| The core plotting and graphics engine in R is encapsulated in two packages. The first is
| the graphics package which contains plotting functions for the "base" system. The
| functions in this package include plot, hist, boxplot, barplot, etc. The second package is
| grDevices which contains all the code implementing the various graphics devices, including
| X11, PDF, PostScript, PNG, etc.

...

|==== | 5%
| Base graphics are often constructed piecemeal, with each aspect of the plot handled
| separately through a particular function call. Usually you start with a plot function
| (such as plot, hist, or boxplot), then you use annotation functions (text, abline, points)

...

|===== | 6%
| Before making a plot you have to determine where the plot will appear and what it will be
| used for. Is there a large amount of data going into the plot? Or is it just a few
| points? Do you need to be able to dynamically resize the graphic?

...

|====== | 8%
| What do you think is a disadvantage of the Base Plotting System?

1: It mirrors how we think of building plots and analyzing data
2: You can't go back once a plot has started
3: It's intuitive and exploratory
4: A complicated plot is a series of simple R commands

Selection: 2

| That's a job well done!

|======== | 9%
| Yes! The base system is very intuitive and easy to use. You can't go backwards, though,
| say, if you need to readjust margins or have misspelled a caption. A finished plot will be
| a series of R commands, so it's difficult to translate a finished plot into a different
| system.

...

|========= | 11%
| Calling a basic routine such as plot(x, y) or hist(x) launches a graphics device (if one
| is not already open) and draws a new plot on the device. If the arguments to plot or hist
| are not of some special class, then the default method is called.

...

|========== | 12%
| As you'll see, most of the base plotting functions have many arguments, for example,
| setting the title, labels of axes, plot character, etc. Some of the parameters can be set
| when you call the function or they can be added later in a separate function call.

...

|=========== | 14%
| Now we'll go through some quick examples of basic plotting before we delve into gory
| details. We'll use the dataset airquality (part of the library datasets) which we've
| loaded for you. This shows ozone and other air measurements for New York City for 5 months
| in 1973.

...

|============= | 15%
| Use the R command head with airquality as an argument to see what the data looks like.

Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

|============== | 17%
| We see the dataset contains 6 columns of data. Run the command range with two arguments.
| The first is the ozone column of airquality, specified by airquality\$Ozone, and the second | is the boolean na.rm set equal to TRUE. If you don't specify this second argument, you | won't get a meaningful result. range(airquality\$Ozone,na.rm=TRUE)
[1] 1 168

|=============== | 18%
| So the measurements range from 1 to 168. First we'll do a simple histogram of this ozone
| column to show the distribution of measurements. Use the R command hist with the argument
| airquality\$Ozone. hist(airquality\$Ozone)

| You're the best!

|================ | 20%
| Simple, right? R put a title on the histogram and labeled both axes for you. What is the
| most frequent count?

1: Under 25
2: Over 100
3: Over 150
4: Between 60 and 75

Selection: 1

| Great job!

|================== | 21%
| Next we'll do a boxplot. First, though, run the R command table with the argument
| airquality\$Month. table(airquality\$Month)

5 6 7 8 9
31 30 31 31 30

| Excellent work!

|=================== | 23%
| We see that the data covers 5 months, May through September. We'll want a boxplot of ozone
| as a function of the month in which the measurements were taken so we'll use the R formula
| Ozone~Month as the first argument of boxplot. Our second argument will be airquality, the
| dataset from which the variables of the first argument are taken. Try this now.

boxplot(Ozone~Month,airquality)

| Keep up the great work!

|==================== | 24%
| Note that boxplot, unlike hist, did NOT specify a title and axis labels for you
| automatically.

...

|===================== | 26%
| Let's call boxplot again to specify labels. (Use the up arrow to recover the previous
| command and save yourself some typing.) We'll add more arguments to the call to specify
| labels for the 2 axes. Set xlab equal to "Month" and ylab equal to "Ozone (ppb)". Specify
| col.axis equal to "blue" and col.lab equal to "red". Try this now.

boxplot(Ozone~Month,airquality,xlab="Month",ylab="Ozone (ppb)",col.axis="blue",col.lab="red")

| All that practice is paying off!

|======================= | 27%
| Nice colors, but still no title. Let's add one with the R command title. Use the argument
| main set equal to the string "Ozone and Wind in New York City".

title(main="Ozone and Wind in New York City")

| Nice work!

|======================== | 29%
| Now we'll show you how to plot a simple two-dimensional scatterplot using the R function
| plot. We'll show the relationship between Wind (x-axis) and Ozone (y-axis). We'll use the
| function plot with those two arguments (Wind and Ozone, in that order). To save some
| typing, though, we'll call the R command with using 2 arguments. The first argument of
| with will be airquality, the dataset containing Wind and Ozone; the second argument will
| be the call to plot. Doing this allows us to avoid using the longer notation, e.g.,
| airquality\$Wind. Try this now. with(airquality,plot(Wind,Ozone)) | Perseverance, that's the answer. |========================= | 30% | Note that plot generated labels for the x and y axes but no title. ... |========================== | 32% | Add one now with the R command title. Use the argument main set equal to the string "Ozone | and Wind in New York City". (You can use the up arrow to recover the command if you don't | want to type it.) title(main="Ozone and Wind in New York City") | Perseverance, that's the answer. |============================ | 33% | The basic plotting parameters are documented in the R help page for the function par. You | can use par to set parameters OR to find out what values are already set. To see just how | much flexibility you have, run the R command length with the argument par() now. length(par()) [1] 72 | All that hard work is paying off! |============================= | 35% | So there are a boatload (72) of parameters that par() gives you access to. Run the R | function names with par() as its argument to see what these parameters are. names(par()) [1] "xlog" "ylog" "adj" "ann" "ask" "bg" "bty" [8] "cex" "cex.axis" "cex.lab" "cex.main" "cex.sub" "cin" "col" [15] "col.axis" "col.lab" "col.main" "col.sub" "cra" "crt" "csi" [22] "cxy" "din" "err" "family" "fg" "fig" "fin" [29] "font" "font.axis" "font.lab" "font.main" "font.sub" "lab" "las" [36] "lend" "lheight" "ljoin" "lmitre" "lty" "lwd" "mai" [43] "mar" "mex" "mfcol" "mfg" "mfrow" "mgp" "mkh" [50] "new" "oma" "omd" "omi" "page" "pch" "pin" [57] "plt" "ps" "pty" "smo" "srt" "tck" "tcl" [64] "usr" "xaxp" "xaxs" "xaxt" "xpd" "yaxp" "yaxs" [71] "yaxt" "ylbias" | You got it right! |============================== | 36% | Variety is the spice of life. You might recognize some of these such as col and lwd from | previous swirl lessons. You can always run ?par to see what they do. For now, run the | command par()\$pin and see what you get.

par()\$pin [1] 4.520417 1.805833 | You got it! |=============================== | 38% | Alternatively, you could have gotten the same result by running par("pin") or par('pin')). | What do you think these two numbers represent? 1: Coordinates of the center of the plot window 2: Random numbers 3: A confidence interval 4: Plot dimensions in inches Selection: 4 | All that hard work is paying off! |================================= | 39% | Now, run the command par("fg") or or par('fg') or par()\$fg and see what you get.

par("fg")
[1] "black"

| You nailed it! Good job!

|================================== | 41%
| It gave you a color, right? Since par()\$fg specifies foreground color, what do you think | par()\\$bg specifies?

1: Beautiful color
2: blue-green
3: Better color
4: Background color

Selection: 4

| You are amazing!

|=================================== | 42%
| Many base plotting functions share a set of parameters. We'll go through some of the more
| commonly used ones now. See if you can tell what they do from their names.

...

|==================================== | 44%
| What do you think the graphical parameter pch controls?

1: pc help
2: plot character
3: point control height
4: picture characteristics

Selection: 2

| You are doing so well!

|=================================== | 45%
| The plot character default is the open circle, but it "can either be a single
| character or an integer code for one of a set of graphics symbols." Run the command
| par("pch") to see the integer value of the default. When you need to, you can use
| R's Documentation (?pch) to find what the other values mean.

par("pch")
[1] 1

| You're the best!

|==================================== | 47%
| So 1 is the code for the open circle. What do you think the graphical parameters lty
| and lwd control respectively?

1: line length and width
2: line slope and intercept
3: line type and width
4: line width and type

Selection: 3

| You nailed it! Good job!

|===================================== | 48%
| Run the command par("lty") to see the default line type.

par("lty")
[1] "solid"

| Excellent job!

|====================================== | 50%
| So the default line type is solid, but it can be dashed, dotted, etc. Once again,
| R's ?par documentation will tell you what other line types are available. The line
| width is a positive integer; the default value is 1.

...

|======================================== | 52%
| We've seen a lot of examples of col, the plotting color, specified as a number,
| string, or hex code; the colors() function gives you a vector of colors by name.

...

|========================================= | 53%
| What do you think the graphical parameters xlab and ylab control respectively?

1: labels for the y- and x- axes
2: labels for the x- and y- axes

Selection: 2

| You are quite good my friend!

|========================================== | 55%
| The par() function is used to specify global graphics parameters that affect all
| plots in an R session. (Use dev.off or plot.new to reset to the defaults.) These
| parameters can be overridden when specified as arguments to specific plotting
| functions. These include las (the orientation of the axis labels on the plot), bg
| (background color), mar (margin size), oma (outer margin size), mfrow and mfcol
| (number of plots per row, column).

...

|=========================================== | 56%
| The last two, mfrow and mfcol, both deal with multiple plots in that they specify
| the number of plots per row and column. The difference between them is the order in
| which they fill the plot matrix. The call mfrow will fill the rows first while mfcol
| fills the columns first.

...

|============================================ | 58%
| So to reiterate, first call a basic plotting routine. For instance, plot makes a
| scatterplot or other type of plot depending on the class of the object being
| plotted.

...

|============================================== | 59%
| As we've seen, R provides several annotating functions. Which of the following is
| NOT one of them?

1: title
2: hist
3: lines
4: text
5: points

Selection: 2

|=============================================== | 61%
| So you can add text, title, points, and lines to an existing plot. To add lines, you
| give a vector of x values and a corresponding vector of y values (or a 2-column
| matrix); the function lines just connects the dots. The function text adds text
| labels to a plot using specified x, y coordinates.

...

|================================================ | 62%
| The function title adds annotations. These include x- and y- axis labels, title,
| subtitle, and outer margin. Two other annotating functions are mtext which adds
| arbitrary text to either the outer or inner margins of the plot and axis which adds
| axis ticks and labels. Another useful function is legend which explains to the

...

|================================================= | 64%
| Before we close, let's test your ability to make a somewhat complicated scatterplot.
| First run plot with 3 arguments. airquality\$Wind, airquality\\$Ozone, and type set
| equal to "n". This tells R to set up the plot but not to put the data in it.

plot(airquality\$Wind,airquality\\$Ozone,type="n")
There were 12 warnings (use warnings() to see them)

| You got it!

|================================================== | 65%
| Now for the test. (You might need to check R's documentation for some of these.) Add
| a title with the argument main set equal to the string "Wind and Ozone in NYC"

title(main = "Wind and Ozone in NYC")

| That's the answer I was looking for.

|=================================================== | 67%
| Now create a variable called may by subsetting airquality appropriately. (Recall
| that the data specifies months by number and May is the fifth month of the year.)

may<-subset(airquality,Month==5)

| All that practice is paying off!

|===================================================== | 68%
| Now use the R command points to plot May's wind and ozone (in that order) as solid
| blue triangles. You have to set the color and plot character with two separate
| arguments. Note we use points because we're adding to an existing plot.

points(may\$Wind,may\\$Ozone,col="blue",pch=17)

| That's a job well done!

|====================================================== | 70%
| Now create the variable notmay by subsetting airquality appropriately.

notmay<-subset(airquality,Month!=5)

| Keep up the great work!

|======================================================= | 71%
| Now use the R command points to plot these notmay's wind and ozone (in that order)
| as red snowflakes.

points(notmay\$Wind,notmay\\$Ozone,col="red",pch=8)

|======================================================== | 73%
| Now we'll use the R command legend to clarify the plot and explain what it means.
| The function has a lot of arguments, but we'll only use 4. The first will be the
| string "topright" to tell R where to put the legend. The remaining 3 arguments will
| each be 2-long vectors created by R's concatenate function, e.g., c(). These
| arguments are pch, col, and legend. The first is the vector (17,8), the second
| ("blue","red"), and the third ("May","Other Months"). Try it now.

legend("topright",pch=c(17,8),col=c("blue","red"),legend=c("May","Other Months"))

| That's a job well done!

|========================================================= | 74%
| Now add a vertical line at the median of airquality\$Wind. Make it dashed (lty=2) | with a width of 2. abline(v=median(airquality\$Wind),lty=2,lwd=2)

| You are really on a roll!

|========================================================== | 76%
| Use par with the parameter mfrow set equal to the vector (1,2) to set up the plot
| window for two plots side by side. You won't see a result.

par(mfrow=c(1,2))

| You are doing so well!

|============================================================ | 77%
| Now plot airquality\$Wind and airquality\\$Ozone and use main to specify the title
| "Ozone and Wind".

plot(airquality\$Wind,airquality\\$Ozone,main="Ozone and Wind")

|============================================================= | 79%
| Now for the second plot.

...

|============================================================== | 80%
| Plot airquality\$Ozone and airquality\\$Solar.R and use main to specify the title

plot(airquality\$Ozone,airquality\\$Solar.R,main="Ozone and Solar Radiation")

| That's correct!

|=============================================================== | 82%
| Now for something more challenging.

...

|================================================================ | 83%
| This one with 3 plots, to illustrate inner and outer margins. First, set up the plot
| window by typing par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))

par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))

|================================================================= | 85%
| Margins are specified as 4-long vectors of integers. Each number tells how many
| lines of text to leave at each side. The numbers are assigned clockwise starting at
| the bottom. The default for the inner margin is c(5.1, 4.1, 4.1, 2.1) so you can see
| we reduced each of these so we'll have room for some outer text.

...

|================================================================== | 86%
| The first plot should be familiar. Plot airquality\$Wind and airquality\\$Ozone with
| the title (argument main) as "Ozone and Wind".

plot(airquality\$Wind,airquality\\$Ozone,main="Ozone and Wind")

| You nailed it! Good job!

|==================================================================== | 88%
| The second plot is similar.

...

|===================================================================== | 89%
| Plot airquality\$Solar.R and airquality\\$Ozone with the title (argument main) as

plot(airquality\$Solar.R,airquality\\$Ozone,main="Ozone and Solar Radiation")

| That's a job well done!

|====================================================================== | 91%
| Now for the final panel.

...

|======================================================================= | 92%
| Plot airquality\$Temp and airquality\\$Ozone with the title (argument main) as "Ozone
| and Temperature".

plot(airquality\$Temp,airquality\\$Ozone,main="Ozone and Temperature")

| You got it!

|======================================================================== | 94%
| Now we'll put in a title.

...

|========================================================================== | 95%
| Since this is the main title, we specify it with the R command mtext. Call mtext
| with the string "Ozone and Weather in New York City" and the argument outer set
| equal to TRUE.

mtext("Ozone and Weather in New York City",outer=TRUE)

| That's correct!

|=========================================================================== | 97%
| Voila! Beautiful, right?

...

|============================================================================ | 98%
| Congrats! You've weathered this lesson nicely and passed out of the No!zone.

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: No
2: Yes

Selection: 2
What is your assignment token? xXxXxxXXxXxxXXXx

| All that practice is paying off!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

Last updated 2020-05-04 20:05:26.618647 IST

# Plotting Systems

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

setwd("C:/Users/kk/PortableApps/Git/home/k-allika/repos/DataScienceWithR/04_Exploratory_Data_Analysis/week01/workspace")
library(swirl)

| Hi! Type swirl() when you are ready to begin.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as you did
| then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Would you like to continue with one of these lessons?

1: Exploratory Data Analysis Plotting Systems
2: No. Let me start something new.

Selection: 1

| Attempting to load lesson dependencies...

| Plotting_Systems. (Slides for this and other Data Science courses may be found at github
| https://github.com/DataScienceSpecialization/courses/. If you care to use them, they must
| be downloaded as a zip file and viewed locally. This lesson corresponds to
| 04_ExploratoryAnalysis/PlottingSystems.)

...

|== | 3%
| In this lesson, we'll give you a brief overview of the three plotting systems in R, their
| differences, strengths, and weaknesses. We'll only cover the basics here to give you a
| general idea of the systems and in later lessons we'll cover each system in more depth.

...

|==== | 5%
| The first plotting system is the Base Plotting System which comes with R. It's the oldest
| system which uses a simple "Artist's palette" model. What this means is that you start
| with a blank canvas and build your plot up from there, step by step.

...

|======= | 8%
| Usually you start with a plot function (or something similar), then you use annotation
| functions to add to or modify your plot. R provides many annotating functions such as
| text, lines, points, and axis. R provides documentation for each of these. They all add to

...

|========= | 11%
| What do you think is a disadvantage of the Base Plotting System?

1: It mirrors how we think of building plots and analyzing data
2: A complicated plot is a series of simple R commands
3: You can't go back once a plot has started
4: It's intuitive and exploratory

Selection: 3

| Nice work!

|=========== | 14%
| Yes! The base system is very intuitive and easy to use when you're starting to do
| exploratory graphing and looking for a research direction. You can't go backwards, though,
| say, if you need to readjust margins or fix a misspelled a caption. A finished plot will
| be a series of R commands, so it's difficult to translate a finished plot into a different
| system.

...

|============= | 16%
| We've loaded the dataset cars for you to demonstrate how easy it is to plot. First, use
| the R command head with cars as an argument to see what the data looks like.

speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10

| You are really on a roll!

|================ | 19%
| So the dataset collates the speeds and distances needed to stop for 50 cars. This data was
| recorded in the 1920's.

...

|================== | 22%
| We'll use the R command with which takes two arguments. The first specifies a dataset or
| environment in which to run the second argument, an R expression. This will save us a bit
| of typing. Try running the command with now using cars as the first argument and a call to
| plot as the second. The call to plot will take two arguments, speed and dist. Please
| specify them in that order.

with(cars,plot(speed,dist))

| You got it right!

|==================== | 24%
| Simple, right? You can see the relationship between the two variables, speed and distance.
| The first variable is plotted along the x-axis and the second along the y-axis.

...

|====================== | 27%
| Now we'll show you what the function text does. Run the command text with three arguments.
| The first two, x and y coordinates, specify the placement of the third argument, the text
| to be added to the plot. Let the first argument be mean(cars\$speed), the second | max(cars\\$dist), and the third the string "SWIRL rules!". Try it now.

text(mean(cars\$speed),max(cars\\$dist),"SWIRL rules!")

| You are quite good my friend!

|========================= | 30%
| Ain't it the truth?

...

|=========================== | 32%
| Now we'll move on to the second plotting system, the Lattice System which comes in the
| package of the same name. Unlike the Base System, lattice plots are created with a single
| function call such as xyplot or bwplot. Margins and spacing are set automatically because
| the entire plot is specified at once.

...

|============================= | 35%
| The lattice system is most useful for conditioning types of plots which display how y
| changes with x across levels of z. The variable z might be a categorical variable of your
| data. This system is also good for putting many plots on a screen at once.

...

|=============================== | 38%
| The lattice system has several disadvantages. First, it is sometimes awkward to specify an
| entire plot in a single function call. Annotating a plot may not be especially intuitive.
| Second, using panel functions and subscripts is somewhat difficult and requires
| preparation. Finally, you cannot "add" to the plot once it is created as you can with the
| base system.

...

|================================== | 41%

| As before, we've loaded some data for you in the variable state. This data comes with the
| lattice package and it concerns various characteristics of the 50 states in the U.S. Use
| the R command head to see the first few entries of state now.

Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area region
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 West
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 West
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 South
California 21198 5114 1.1 71.71 10.3 62.6 20 156361 West
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 West

| You are really on a roll!

|==================================== | 43%
| As you can see state holds 9 pieces of information for each of the 50 states. The last
| variable, region, specifies a category for each state. Run the R command table with the
| argument state\$region to see how many categories there are and how many states are in | each. table(state\$region)

Northeast         South North Central          West
9            16            12            13



|====================================== | 46%
| So there are 4 categories and the 50 states are sorted into them appropriately. Let's use
| the lattice command xyplot to see how life expectancy varies with income in each of the
| four regions.

...

|======================================== | 49%
| To do this we'll give xyplot 3 arguments. The first is the most complicated. It is this R
| formula, Life.Exp ~ Income | region, which indicates we're plotting life expectancy as it
| depends on income for each region. The second argument, data, is set equal to state. This
| allows us to use "Life.Exp" and "Income" in the formula instead of specifying the dataset
| state for each term (as in state\$Income). The third argument, layout, is set equal to the | two-long vector c(4,1). Run xyplot now with these three arguments. xyplot(Life.Exp~Income|region,data=state,layout=c(4,1)) | Perseverance, that's the answer. |=========================================== | 51% | We see the data for each of the 4 regions plotted in one row. Based on this plot, which | region of the U.S. seems to have the shortest life expectancy? 1: West 2: South 3: Northeast 4: North Central Selection: 2 | You got it! |============================================= | 54% | Just for fun rerun the xyplot and this time set layout to the vector c(2,2). To save | typing use the up arrow to recover the previous xyplot command. xyplot(Life.Exp~Income|region,data=state,layout=c(2,2)) | Your dedication is inspiring! |=============================================== | 57% | See how the plot changed? No need for you to worry about margins or labels. The package | took care of all that for you. ... |================================================= | 59% | Now for the last plotting system, ggplot2, which is a hybrid of the base and lattice | systems. It automatically deals with spacing, text, titles (as Lattice does) but also | allows you to annotate by "adding" to a plot (as Base does), so it's the best of both | worlds. ... |==================================================== | 62% | Although ggplot2 bears a superficial similarity to lattice, it's generally easier and more | intuitive to use. Its default mode makes many choices for you but you can still customize | a lot. The package is based on a "grammar of graphics" (hence the gg in the name), so you | can control the aesthetics of your plots. For instance, you can plot conditioning graphs | and panel plots as we did in the lattice example. ... |====================================================== | 65% | We'll see an example now of ggplot2 with a simple (single) command. As before, we've | loaded a dataset for you from the ggplot2 package. This mpg data holds fuel economy data | between 1999 and 2008 for 38 different models of cars. Run head with mpg as an argument so | you get an idea of what the data looks like. head(mpg) # A tibble: 6 x 11 manufacturer model displ year cyl trans drv cty hwy fl class 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 3 audi a4 2 2008 4 manual(m6) f 20 31 p compact 4 audi a4 2 2008 4 auto(av) f 21 30 p compact 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact | That's a job well done! |======================================================== | 68% | Looks complicated. Run dim with the argument mpg to see how big the dataset is. dim(mpg) [1] 234 11 | Excellent work! |========================================================== | 70% | Holy cow! That's a lot of information for just 38 models of cars. Run the R command table | with the argument mpg\$model. This will tell us how many models of cars we're dealing with.

table(mpg\$model)  4runner 4wd a4 a4 quattro a6 quattro 6 7 8 3 altima c1500 suburban 2wd camry camry solara 6 5 7 7 caravan 2wd civic corolla corvette 11 9 5 5 dakota pickup 4wd durango 4wd expedition 2wd explorer 4wd 9 7 3 6 f150 pickup 4wd forester awd grand cherokee 4wd grand prix 7 6 8 5 gti impreza awd jetta k1500 tahoe 4wd 5 8 9 4  land cruiser wagon 4wd malibu maxima mountaineer 4wd 2 5 3 4 mustang navigator 2wd new beetle passat 9 3 6 7 pathfinder 4wd ram 1500 pickup 4wd range rover sonata 4 10 4 7 tiburon toyota tacoma 4wd 7 7 | Nice work! |============================================================= | 73% | Oh, there are 38 models. We're interested in the effect engine displacement (displ) has on | highway gas mileage (hwy), so we'll use the ggplot2 command qplot to display this | relationship. Run qplot now with three arguments. The first two are the variables displ | and hwy we want to plot, and the third is the argument data set equal to mpg. As before, | this allows us to avoid using the mpg\$variable notation for the first two arguments.

qplot(displ,hwy,data=mpg)

| You are doing so well!

|=============================================================== | 76%
| Not surprisingly we see that the bigger the engine displacement the lower the gas mileage.

...

|================================================================= | 78%
| Let's review!

...

|=================================================================== | 81%
| Which R plotting system is based on an artist's palette?

1: Winsor&Newton
2: ggplot2
3: base
4: lattice

Selection: 3

| All that practice is paying off!

|====================================================================== | 84%
| Which R plotting system does NOT allow you to annotate plots with separate calls?

1: base
2: ggplot2
3: Winsor&Newton
4: lattice

Selection: 4

| You got it right!

|======================================================================== | 86%
| Which R plotting system combines the best features of the other two?

1: base
2: Winsor&Newton
3: lattice
4: ggplot2

Selection: 4

| You are doing so well!

|========================================================================== | 89%
| Which R plotting system uses a graphics grammar?

1: lattice
2: base
3: Winsor&Newton
4: ggplot2

Selection: 4

| You're the best!

|============================================================================ | 92%
| Which R plotting system forces you to make your entire plot with one call?

1: Winsor&Newton
2: lattice
3: ggplot2
4: base

Selection: 2

| You are amazing!

|=============================================================================== | 95%
| Which of the following sells high quality artists' brushes?

1: Winsor&Newton
2: base
3: ggplot2
4: lattice

Selection: 1

| You nailed it! Good job!

|================================================================================= | 97%
| Congrats! You've concluded this plotting lesson. We hope you didn't find it plodding.

...

|===================================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: Yes
2: No

Selection: 1
What is your assignment token? xXxXxxXXxXxxXXXx

| You nailed it! Good job!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

Last updated 2020-05-04 19:59:20.871482 IST