# Exploratory Graphs

library(swirl)

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as you did

| then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis

2: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

1: Principles of Analytic Graphs 2: Exploratory Graphs

3: Graphics Devices in R 4: Plotting Systems

5: Base Plotting System 6: Lattice Plotting System

7: Working with Colors 8: GGPlot2 Part1

9: GGPlot2 Part2 10: GGPlot2 Extras

11: Hierarchical Clustering 12: K Means Clustering

13: Dimension Reduction 14: Clustering Example

15: CaseStudy

Selection: 2

| | 0%

| Exploratory_Graphs. (Slides for this and other Data Science courses may be found at github

| https://github.com/DataScienceSpecialization/courses/. If you care to use them, they must

| be downloaded as a zip file and viewed locally. This lesson corresponds to

| 04_ExploratoryAnalysis/exploratoryGraphs.)

Error in (function (srcref) : unimplemented type (29) in 'eval'

Error: no more error handlers available (recursive errors?); invoking 'abort' restart

Error in readline("...") :

INTEGER() can only be applied to a 'integer', not a 'unknown type #29'

In addition: Warning message:

In readline("...") : type 29 is unimplemented in 'type2char'

| Leaving swirl now. Type swirl() to resume.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as you did

| then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Would you like to continue with one of these lessons?

1: Exploratory Data Analysis Exploratory Graphs

2: No. Let me start something new.

Selection: 1

| Exploratory_Graphs. (Slides for this and other Data Science courses may be found at github

| https://github.com/DataScienceSpecialization/courses/. If you care to use them, they must

| be downloaded as a zip file and viewed locally. This lesson corresponds to

| 04_ExploratoryAnalysis/exploratoryGraphs.)

...

|= | 1%

| In this lesson, we'll discuss why graphics are an important tool for data scientists and

| the special role that exploratory graphs play in the field.

...

|== | 3%

| Which of the following would NOT be a good reason to use graphics in data science?

1: To understand data properties

2: To find a color that best matches the shirt you're wearing

3: To find patterns in data

4: To suggest modeling strategies

Selection: 2

| All that practice is paying off!

|=== | 4%

| So graphics give us some visual form of data, and since our brains are very good at seeing

| patterns, graphs give us a compact way to present data and find or display any pattern

| that may be present.

...

|==== | 5%

| Which of the following cliches captures the essence of graphics?

1: To err is human, to forgive divine

2: A rose by any other name smells as sweet

3: A picture is worth a 1000 words

4: The apple doesn't fall far from the tree

Selection: 3

| Excellent work!

|====== | 7%

| Exploratory graphs serve mostly the same functions as graphs. They help us find patterns

| in data and understand its properties. They suggest modeling strategies and help to debug

| analyses. We DON'T use exploratory graphs to communicate results.

...

|======= | 8%

| Instead, exploratory graphs are the initial step in an investigation, the "quick and

| dirty" tool used to point the data scientist in a fruitful direction. A scientist might

| need to make a lot of exploratory graphs in order to develop a personal understanding of

| the problem being studied. Plot details such as axes, legends, color and size are cleaned

| up later to convey more information in an aesthetically pleasing way.

...

|======== | 9%

| To demonstrate these ideas, we've copied some data for you from the U.S. Environmental

| Protection Agency (EPA) which sets national ambient air quality standards for outdoor air

| pollution. These Standards say that for fine particle pollution (PM2.5), the "annual mean,

| averaged over 3 years" cannot exceed 12 micro grams per cubic meter. We stored the data

| from the U.S. EPA web site in the data frame pollution. Use the R function head to see the

| first few entries of pollution.

head(pollution)

pm25 fips region longitude latitude

1 9.771185 01003 east -87.74826 30.59278

2 9.993817 01027 east -85.84286 33.26581

3 10.688618 01033 east -87.72596 34.73148

4 11.337424 01049 east -85.79892 34.45913

5 12.119764 01055 east -86.03212 34.01860

6 10.827805 01069 east -85.35039 31.18973

| You nailed it! Good job!

|========= | 11%

| We see right away that there's at least one county exceeding the EPA's standard of 12

| micrograms per cubic meter. What else do we see?

...

|========== | 12%

| We see 5 columns of data. The pollution count is in the first column labeled pm25. We'll

| work mostly with that. The other 4 columns are a fips code indicating the state (first 2

| digits) and county (last 3 digits) with that count, the associated region (east or west),

| and the longitude and latitude of the area. Now run the R command dim with pollution as an

| argument to see how long the table is.

dim(pollution)

[1] 576 5

| Great job!

|=========== | 13%

| So there are 576 entries in pollution. We'd like to investigate the question "Are there

| any counties in the U.S. that exceed that national standard (12 micro grams per cubic

| meter) for fine particle pollution?" We'll look at several one dimensional summaries of

| the data to investigate this question.

...

|============ | 15%

| The first technique uses the R command summary, a 5-number summary which returns 6

| numbers. Run it now with the pm25 column of pollution as its argument. Recall that the

| construct for this is pollution$pm25.

summary(pollution$pm25)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.383 8.549 10.047 9.836 11.356 18.441

| You are doing so well!

|============= | 16%

| This shows us basic info about the pm25 data, namely its Minimum (0 percentile) and

| Maximum (100 percentile) values, and three Quartiles of the data. These last indicate the

| pollution measures at which 25%, 50%, and 75% of the counties fall below. In addition to

| these 5 numbers we see the Mean or average measure of particulate pollution across the 576

| counties.

...

|============== | 17%

| Half the measured counties have a pollution level less than or equal to what number of

| micrograms per cubic meter?

1: 10.050

2: 9.836

3: 8.549

4: 11.360

Selection: 1

| You're the best!

|=============== | 19%

| To save you a lot of typing we've saved off pollution$pm25 for you in the variable ppm.

| You can use ppm now in place of the longer expression. Try it now as the argument of the R

| command quantile. See how the results look a lot like the results of the output of the

| summary command.

quantile(ppm)

0% 25% 50% 75% 100%

3.382626 8.548799 10.046697 11.356012 18.440731

| All that hard work is paying off!

|================= | 20%

| See how the results are similar to those returned by summary? Quantile gives the

| quartiles, right? What is the one value missing from this quantile output that summary

| gave you?

1: the maximum value

2: the median

3: the minimum value

4: the mean

Selection: 4

| Excellent work!

|================== | 21%

| Now we'll plot a picture, specifically a boxplot. Run the R command boxplot with ppm as an

| input. Also specify the color parameter col equal to "blue".

boxplot(ppm,col="blue")

| That's a job well done!

|=================== | 23%

| The boxplot shows us the same quartile data that summary and quantile did. The lower and

| upper edges of the blue box respectively show the values of the 25% and 75% quantiles.

...

|==================== | 24%

| What do you think the horizontal line inside the box represents?

1: the maximum value

2: the mean

3: the minimum value

4: the median

Selection: 4

| Nice work!

|===================== | 25%

| The "whiskers" of the box (the vertical lines extending above and below the box) relate to

| the range parameter of boxplot, which we let default to the value 1.5 used by R. The

| height of the box is the interquartile range, the difference between the 75th and 25th

| quantiles. In this case that difference is 2.8. The whiskers are drawn to be a length of

| range*2.8 or 1.5*2.8. This shows us roughly how many, if any, data points are outliers,

| that is, beyond this range of values.

...

|====================== | 27%

| Note that boxplot is part of R's base plotting package. A nice feature that this package

| provides is its ability to overlay features. That is, you can add to (annotate) an

| existing plot.

...

|======================= | 28%

| To see this, run the R command abline with the argument h equal to 12. Recall that 12 is

| the EPA standard for air quality.

abline(h=12)

| That's a job well done!

|======================== | 29%

| What do you think this command did?

1: drew a horizontal line at 12

2: hid 12 random data points

3: drew a vertical line at 12

4: nothing

Selection: 1

| Keep up the great work!

|========================= | 31%

| So abline "adds one or more straight lines through the current plot." We see from the plot

| that the bulk of the measured counties comply with the standard since they fall under the

| line marking that standard.

...

|=========================== | 32%

| Now use the R command hist (another function from the base package) with the argument ppm.

| Specify the color parameter col equal to "green". This will plot a histogram of the data.

hist(ppm,col="green")

| You nailed it! Good job!

|============================ | 33%

| The histogram gives us a little more detailed information about our data, specifically the

| distribution of the pollution counts, or how many counties fall into each bucket of

| measurements.

...

|============================= | 35%

| What are the most frequent pollution counts?

1: between 9 and 12

2: between 12 and 14

3: between 6 and 8

4: under 5

Selection: 1

| You're the best!

|============================== | 36%

| Now run the R command rug with the argument ppm.

rug(ppm)

| That's correct!

|=============================== | 37%

| This one-dimensional plot, with its grayscale representation, gives you a little more

| detailed information about how many data points are in each bucket and where they lie

| within the bucket. It shows (through density of tick marks) that the greatest

| concentration of counties has between 9 and 12 micrograms per cubic meter just as the

| histogram did.

...

|================================ | 39%

| To illustrate this a little more, we've defined for you two vectors, high and low,

| containing pollution data of high (greater than 15) and low (less than 5) values

| respectively. Look at low now and see how it relates to the output of rug.

low

[1] 3.494351 4.186090 4.917140 4.504539 4.793644 4.601408 4.195688 4.625279 4.460193

[10] 4.978397 4.324736 4.175901 3.382626 4.132739 4.955570 4.565808

| All that hard work is paying off!

|================================= | 40%

| It confirms that there are two data points between 3 and 4 and many between 4 and 5. Now

| look at high.

high

[1] 16.19452 15.80378 18.44073 16.66180 15.01573 17.42905 16.25190 16.18358

| Excellent job!

|================================== | 41%

| Again, we see one data point greater than 18, one between 17 and 18, several between 16

| and 17 and two between 15 and 16, verifying what rug indicated.

...

|=================================== | 43%

| Now rerun hist with 3 arguments, ppm as its first, col equal to "green", and the argument

| breaks equal to 100.

hist(ppm,col="green",breaks=100)

| All that practice is paying off!

|===================================== | 44%

| What do you think the breaks argument specifies in this case?

1: the number of data points to graph

2: the number of counties exceeding the EPA standard

3: the number of buckets to split the data into

4: the number of stars in the sky

Selection: 3

| You are amazing!

|====================================== | 45%

| So this histogram with more buckets is not nearly as smooth as the preceding one. In fact,

| it's a little too noisy to see the distribution clearly. When you're plotting histograms

| you might have to experiment with the argument breaks to get a good idea of your data's

| distribution. For fun now, rerun the R command rug with the argument ppm.

rug(ppm)

| That's a job well done!

|======================================= | 47%

| See how rug works with the existing plot? It automatically adjusted its pocket size to

| that of the last plot plotted.

...

|======================================== | 48%

| Now rerun hist with ppm as the data and col equal to "green".

hist(ppm,col="green")

| Great job!

|========================================= | 49%

| Now run the command abline with the argument v equal to 12 and the argument lwd equal to

| 2.

abline(v=12,lwd=2)

| You are doing so well!

|========================================== | 51%

| See the vertical line at 12? Not very visible, is it, even though you specified a line

| width of 2? Run abline with the argument v equal to median(ppm), the argument col equal to

| "magenta", and the argument lwd equal to 4.

abline(v=median(ppm),col="magenta",lwd=4)

| You are quite good my friend!

|=========================================== | 52%

| Better, right? Thicker and more of a contrast in color. This shows that although the

| median (50%) is below the standard, there are a fair number of counties in the U.S that

| have pollution levels higher than the standard.

...

|============================================ | 53%

| Now recall that our pollution data had 5 columns of information. So far we've only looked

| at the pm25 column. We can also look at other information. To remind yourself what's there

| run the R command names with pollution as the argument.

names(pollution)

[1] "pm25" "fips" "region" "longitude" "latitude"

| Keep up the great work!

|============================================= | 55%

| Longitude and latitude don't sound interesting, and each fips is unique since it

| identifies states (first 2 digits) and counties (last 3 digits). Let's look at the region

| column to see what's there. Run the R command table on this column. Use the construct

| pollution$region. Store the result in the variable reg.

reg<-table(pollution$region)

| All that practice is paying off!

|============================================== | 56%

| Look at reg now.

reg

east west

442 134

| Nice work!

|================================================ | 57%

| Lot more counties in the east than west. We'll use the R command barplot (another type of

| one-dimensional summary) to plot this information. Call barplot with reg as its first

| argument, the argument col equal to "wheat", and the argument main equal to the string

| "Number of Counties in Each Region".

barplot(reg,col="wheat",main="Number of Counties in Each Region")

| You are quite good my friend!

|================================================= | 59%

| What do you think the argument main specifies?

1: the y axis label

2: the title of the graph

3: the x axis label

4: I can't tell

Selection: 2

| You are doing so well!

|================================================== | 60%

| So we've seen several examples of one-dimensional graphs that summarize data. Two

| dimensional graphs include scatterplots, multiple graphs which we'll see more examples of,

| and overlayed one-dimensional plots which the R packages such as lattice and ggplot2

| provide.

...

|=================================================== | 61%

| Some graphs have more than two-dimensions. These include overlayed or multiple

| two-dimensional plots and spinning plots. Some three-dimensional plots are tricky to

| understand so have limited applications. We'll see some examples now of more complicated

| graphs, in particular, we'll show two graphs together.

...

|==================================================== | 63%

| First we'll show how R, in one line and using base plotting, can display multiple

| boxplots. We simply specify that we want to see the pollution data as a function of

| region. We know that our pollution data characterized each of the 576 entries as belonging

| to one of two regions (east and west).

...

|===================================================== | 64%

| We use the R formula y ~ x to show that y (in this case pm25) depends on x (region). Since

| both come from the same data frame (pollution) we can specify a data argument set equal to

| pollution. By doing this, we don't have to type pollution$pm25 (or ppm) and
| pollution$region. We can just specify the formula pm25~region. Call boxplot now with this

| formula as its argument, data equal to pollution, and col equal to "red".

boxplot(pm25~region,data=pollution,col="red")

| Perseverance, that's the answer.

|====================================================== | 65%

| Two for the price of one! Similarly we can plot multiple histograms in one plot, though to

| do this we have to use more than one R command. First we have to set up the plot window

| with the R command par which specifies how we want to lay out the plots, say one above the

| other. We also use par to specify margins, a 4-long vector which indicates the number of

| lines for the bottom, left, top and right. Type the R command

| par(mfrow=c(2,1),mar=c(4,4,2,1)) now. Don't expect to see any new result.

par(mfrow=c(2,1),mar=c(4,4,2,1))

| That's a job well done!

|======================================================= | 67%

| So we set up the plot window for two rows and one column with the mfrow argument. The mar

| argument set up the margins. Before we plot the histograms let's explore the R command

| subset which, not surprisingly, "returns subsets of vectors, matrices or data frames which

| meet conditions". We'll use subset to pull off the data we want to plot. Call subset now

| with pollution as its first argument and a boolean expression testing region for equality

| with the string "east". Put the result in the variable east.

east<-subset(pollution,region=="east")

| Keep working like that and you'll get there!

|======================================================== | 68%

| Use head to look at the first few entries of east.

head(east)

pm25 fips region longitude latitude

1 9.771185 01003 east -87.74826 30.59278

2 9.993817 01027 east -85.84286 33.26581

3 10.688618 01033 east -87.72596 34.73148

4 11.337424 01049 east -85.79892 34.45913

5 12.119764 01055 east -86.03212 34.01860

6 10.827805 01069 east -85.35039 31.18973

| Excellent work!

|========================================================== | 69%

| So east holds more information than we need. We just want to plot a histogram with the

| pm25 portion. Call hist now with the pm25 portion of east as its first argument and col

| equal to "green" as its second.

hist(east$pm25,col="green")

| You got it!

|=========================================================== | 71%

| See? The command par told R we were going to have one column with 2 rows, so it placed

| this histogram in the top position.

...

|============================================================ | 72%

| Now, here's a challenge for you. Plot the histogram of the counties from the west using

| just one R command. Let the appropriate subset command (with the pm25 portion specified)

| be the first argument and col (equal to "green") the second. To cut down on your typing,

| use the up arrow key to get your last command and replace "east" with the subset command.

| Make sure the boolean argument checks for equality between region and "west".

hist(subset(pollution,region=="west")$pm25,col="green")

| You are really on a roll!

|============================================================= | 73%

| See how R does all the labeling for you? Notice that the titles are different since we

| used different commands for the two plots. Let's look at some scatter plots now.

...

|============================================================== | 75%

| Scatter plots are two-dimensional plots which show the relationship between two variables,

| usually x and y. Let's look at a scatterplot showing the relationship between latitude and

| the pm25 data. We'll use plot, a function from R's base plotting package.

...

|=============================================================== | 76%

| We've seen that we can use a function call as an argument when calling another function.

| We'll do this again when we call plot with the arguments latitude and pm25 which are both

| from our data frame pollution. We'll call plot from inside the R command with which

| evaluates "an R expression in an environment constructed from data". We'll use pollution

| as the first argument to with and the call to plot as the second. This allows us to avoid

| typing "pollution$" before the arguments to plot, so it saves us some typing and adds to

| your base of R knowledge. Try this now.

with(pollution,plot(latitude,pm25))

| You nailed it! Good job!

|================================================================ | 77%

| Note that the first argument is plotted along the x-axis and the second along the y. Now

| use abline to add a horizontal line at 12. Use two additional arguments, lwd equal to 2

| and lty also equal to 2. See what happens.

abline(h=12,lwd=2,lty=2)

| That's correct!

|================================================================= | 79%

| See how lty=2 made the line dashed? Now let's replot the scatterplot. This time, instead

| of using with, call plot directly with 3 arguments. The first 2 are pollution$latitude and
| ppm. The third argument, col, we'll use to add color and more information to our plot. Set
| this argument (col) equal to pollution$region and see what happens.

plot(pollution$latitude,ppm,col=pollution$region)

| Perseverance, that's the answer.

|================================================================== | 80%

| We've got two colors on the map to distinguish between counties in the east and those in

| the west. Can we figure out which color is east and which west? See that the high (greater

| than 50) and low (less than 25) latitudes are both red. Latitudes indicate distance from

| the equator, so which half of the U.S. (east or west) has counties at the extreme north

| and south?

1: east

2: west

Selection: 2

| That's a job well done!

|==================================================================== | 81%

| As before, use abline to add a horizontal line at 12. Use two additional arguments, lwd

| equal to 2 and lty also equal to 2.

abline(h=12,lwd=2,lty=2)

| You are amazing!

|===================================================================== | 83%

| We see many counties are above the healthy standard set by the EPA, but it's hard to tell

| overall, which region, east or west, is worse.

...

|====================================================================== | 84%

| Let's plot two scatterplots distinguished by region.

...

|======================================================================= | 85%

| As we did with multiple histograms, we first have to set up the plot window with the R

| command par. This time, let's plot the scatterplots side by side (one row and two

| columns). We also need to use different margins. Type the R command par(mfrow = c(1, 2),

| mar = c(5, 4, 2, 1)) now. Don't expect to see any new result.

par(mfrow = c(1, 2),mar = c(5, 4, 2, 1))

| You are quite good my friend!

|======================================================================== | 87%

| For the first scatterplot, on the left, we'll plot the latitudes and pm25 counts from the

| west. We already pulled out the information for the counties in the east. Let's now get

| the information for the counties from the west. Create the variable west by using the

| subset command with pollution as the first argument and the appropriate boolean as the

| second.

west<-subset(pollution,region="west")

| Keep trying! Or, type info() for more options.

| Type west <- subset(pollution,region=="west") at the command prompt.

west<-subset(pollution,region=="west")

| That's the answer I was looking for.

|========================================================================= | 88%

| Now call plot with three arguments. These are west$latitude (x-axis), west$pm25 (y-axis),

| and the argument main equal to the string "West" (title). Do this now.

plot(west$latitude,west$pm25,main="West")

| You are really on a roll!

|========================================================================== | 89%

| For the second scatterplot, on the right, we'll plot the latitudes and pm25 counts from

| the east.

...

|=========================================================================== | 91%

| As before, use the up arrow key and change the 3 "West" strings to "East".

plot(east$latitude,east$pm25,main="East")

| You're the best!

|============================================================================ | 92%

| See how R took care of all the details for you? Nice, right? It looks like there are more

| dirty counties in the east but the extreme dirt (greater than 15) is in the west.

...

|============================================================================= | 93%

| Let's summarize and review.

...

|=============================================================================== | 95%

| Which of the following characterizes exploratory plots?

1: quick and dead

2: slow and clean

3: quick and dirty

4: slow and steady

Selection: 3

| That's the answer I was looking for.

|================================================================================ | 96%

| True or false? Plots let you summarize the data (usually graphically) and highlight any

| broad features

1: False

2: True

Selection: 2

| You are amazing!

|================================================================================= | 97%

| Which of the following do plots NOT do?

1: Explore basic questions and hypotheses (and perhaps rule them out)

2: Conclude that you are ALWAYS right

3: Suggest modeling strategies for the "next step"

4: Summarize the data (usually graphically) and highlight any broad features

Selection: 2

| That's a job well done!

|================================================================================== | 99%

| Congrats! You've concluded exploring this lesson on graphics. We hope you didn't find it

| too quick or dirty.

...

|===================================================================================| 100%

| Would you like to receive credit for completing this course on Coursera.org?

1: Yes

2: No

Selection: 2

| Excellent job!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis

2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

*Last updated 2020-05-02 19:17:35.849963 IST*

## Comments