Missing Values

Krishnakanth Allika

2020-04-13 23:33

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as you
| did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: R Programming
2: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

1: Basic Building Blocks 2: Workspace and Files 3: Sequences of Numbers
4: Vectors 5: Missing Values 6: Subsetting Vectors
7: Matrices and Data Frames 8: Logic 9: Functions
10: lapply and sapply 11: vapply and tapply 12: Looking at Data
13: Simulation 14: Dates and Times 15: Base Graphics

Selection: 5

| | 0%

| Missing values play an important role in statistics and data analysis. Often, missing
| values must not be ignored, but rather they should be carefully studied to see if
| there's an underlying pattern or cause for their missingness.

...

|==== | 5%
| In R, NA is used to represent any value that is 'not available' or 'missing' (in the
| statistical sense). In this lesson, we'll explore missing values further.

...

|======== | 10%
| Any operation involving NA generally yields NA as the result. To illustrate, let's
| create a vector c(44, NA, 5, NA) and assign it to a variable x.

x<-c(44, NA, 5, NA)

| All that practice is paying off!

|============ | 15%
| Now, let's multiply x by 3.

x*3
[1] 132 NA 15 NA

| You are amazing!

|================ | 20%
| Notice that the elements of the resulting vector that correspond with the NA values in
| x are also NA.

...

|==================== | 25%
| To make things a little more interesting, lets create a vector containing 1000 draws
| from a standard normal distribution with y <- rnorm(1000).

y<-rnorm(1000)

| That's correct!

|======================== | 30%
| Next, let's create a vector containing 1000 NAs with z <- rep(NA, 1000).

z<-rep(NA,1000)

| You are doing so well!

|============================ | 35%
| Finally, let's select 100 elements at random from these 2000 values (combining y and z)
| such that we don't know how many NAs we'll wind up with or what positions they'll
| occupy in our final vector -- my_data <- sample(c(y, z), 100).

my_data<-sample(c(y,z),100)

| You nailed it! Good job!

|================================ | 40%
| Let's first ask the question of where our NAs are located in our data. The is.na()
| function tells us whether each element of a vector is NA. Call is.na() on my_data and
| assign the result to my_na.

my_na<-is.na(my_data)

| That's correct!

|==================================== | 45%
| Now, print my_na to see what you came up with.

my_na
[1] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
[15] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[29] TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
[43] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[57] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[71] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
[85] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
[99] TRUE FALSE

| You are quite good my friend!

|======================================== | 50%
| Everywhere you see a TRUE, you know the corresponding element of my_data is NA.
| Likewise, everywhere you see a FALSE, you know the corresponding element of my_data is
| one of our random draws from the standard normal distribution.

...

|============================================ | 55%
| In our previous discussion of logical operators, we introduced the == operator as a
| method of testing for equality between two objects. So, you might think the expression
| my_data == NA yields the same results as is.na(). Give it a try.

my_data==NA
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[29] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[57] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[85] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

| All that hard work is paying off!

|================================================ | 60%
| The reason you got a vector of all NAs is that NA is not really a value, but just a
| placeholder for a quantity that is not available. Therefore the logical expression is
| incomplete and R has no choice but to return a vector of the same length as my_data
| that contains all NAs.

...

|==================================================== | 65%
| Don't worry if that's a little confusing. The key takeaway is to be cautious when using
| logical expressions anytime NAs might creep in, since a single NA value can derail the
| entire thing.

...

|======================================================== | 70%
| So, back to the task at hand. Now that we have a vector, my_na, that has a TRUE for
| every NA and FALSE for every numeric value, we can compute the total number of NAs in
| our data.

...

|============================================================ | 75%
| The trick is to recognize that underneath the surface, R represents TRUE as the number
| 1 and FALSE as the number 0. Therefore, if we take the sum of a bunch of TRUEs and
| FALSEs, we get the total number of TRUEs.

...

|================================================================ | 80%
| Let's give that a try here. Call the sum() function on my_na to count the total number
| of TRUEs in my_na, and thus the total number of NAs in my_data. Don't assign the result
| to a new variable.

sum(my_na)
[1] 43

| You're the best!

|==================================================================== | 85%
| Pretty cool, huh? Finally, let's take a look at the data to convince ourselves that
| everything 'adds up'. Print my_data to the console.

my_data
[1] -0.578578797 NA -0.112639140 0.836412196 -1.074043937 NA
[7] NA 1.303020726 NA 1.514220057 -1.533126560 -0.366673361
[13] NA -1.032058614 NA -1.631213149 0.379297612 -0.706613051
[19] NA -0.692352920 NA NA 0.535394170 -1.872906664
[25] -0.861449272 -1.321735747 NA -0.787816086 NA 0.801388943
[31] NA -1.487792282 0.470028145 NA NA 1.187583726
[37] -1.704604005 NA 0.596807280 NA NA -1.493099149
[43] 0.265671235 NA -0.985396879 -0.974373033 NA 1.377397659
[49] -0.637308342 -1.450105656 0.192263390 0.776355028 NA NA
[55] NA NA NA 1.567478781 -0.511602362 0.107048330
[61] NA NA -0.408394399 0.592123817 NA 0.305403550
[67] 3.201114883 0.806735141 0.698544788 NA NA NA
[73] -2.671330480 0.123440813 NA NA NA -1.236370155
[79] 0.936670598 NA NA 1.519229128 NA -1.366810674
[85] 0.211749069 NA 0.203741812 1.319234085 NA -0.432928319
[91] -0.006566875 NA NA 0.060568295 0.292428312 NA
[97] NA 0.717821949 NA 0.359249723

| You're the best!

|======================================================================== | 90%
| Now that we've got NAs down pat, let's look at a second type of missing value -- NaN,
| which stands for 'not a number'. To generate NaN, try dividing (using a forward slash)
| 0 by 0 now.

0/0
[1] NaN

| That's a job well done!

|============================================================================ | 95%
| Let's do one more, just for fun. In R, Inf stands for infinity. What happens if you
| subtract Inf from Inf?

Inf-Inf
[1] NaN

| You are really on a roll!

|================================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: Yes
2: No

Selection: 1
What is your email address? xxxxxx@xxxxxxxxxxxx
What is your assignment token? xXxXxxXXxXxxXXXx
Grade submission succeeded!

| You are doing so well!

| You've reached the end of this lesson! Returning to the main
| menu...

| Please choose a course, or type 0 to exit swirl.

Last updated 2020-04-13 23:32:50.765370 IST

Comments