Air Pollution data

Introduction

Data
The zip file containing the data can be downloaded here:

AirPollution.zip (2.67MB)

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains three variables:

Date: the date of the observation in YYYY-MM-DD format (year-month-day)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

This data will be used in the following functions. The goal is load data from these files without dowloading or storing them in the hard disk drive, but be available to the functions to woek on them.

In [7]:
# Create an empty file in memory
file<-tempfile()
# Download the zip file to this empty file
download.file("http://link.datascience.eu.org/p004d1",file,mode="wb")
In [10]:
# Check data file format
head(unzip(file,list=TRUE))
Name Length Date
data/ 0 2012-09-22 00:27:00
data/001.csv 31271 2012-09-22 01:30:00
data/002.csv 81435 2012-09-22 01:30:00
data/003.csv 47185 2012-09-22 01:30:00
data/004.csv 78887 2012-09-22 01:30:00
data/005.csv 63204 2012-09-22 01:30:00

[^top]

Function 1: Mean, Median, Mode, Standard Deviation

The function "PollutantStats" will calculate the mean, median, mode and standard deviation of a pollutant (sulfate or nitrate) across a specified list of monitors.

Input
The function "PollutantStats" accepts two arguments:

  1. "pollutant" - Pollutant name sulfate or nitrate
  2. "id" - An optional Monitor id vector. Default value is 1:332.

Output
Mean, median, mode and standard deviation of the pollutant.

In [19]:
PollutantStats<-function(pollutant,id=1:332){
    # Initialize empty vectors.
    p<-c()    # Pollutant vector
    # Loop through each file number.
    for (i in id){
        # We need to have the file path in "data/001.csv" format to use read.csv() function.
        # Use sprintf to combine directory name and file number in required format.
        df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
        # Using a comma in df[,pollutant] converts it into a numeric vector.
        # Else it remains a dataframe and pollutant_vector becomes a list.
        p<-c(p,df[,pollutant])
    }
    # Calculate mean, median, mode and SD excluding missing values.
    Mean<-mean(p,na.rm=TRUE)
    Median<-mean(p,na.rm=TRUE)
    Mode<-names(table(p))[table(p)==max(table(p))]
    SD<-sd(p,na.rm=TRUE)
    data.frame(Pollutant=pollutant,Mean=Mean,Median=Median,Mode=Mode,Std_dev=SD)
}
In [21]:
PollutantStats("sulfate",1:12)
Pollutant Mean Median Mode Std_dev
sulfate 3.765641 3.765641 2.84 2.874151
In [22]:
PollutantStats("nitrate",74:80)
Pollutant Mean Median Mode Std_dev
nitrate 2.611021 2.611021 1.24 2.802195
In [24]:
PollutantStats("sulfate",10)
Pollutant Mean Median Mode Std_dev
sulfate 0.6243649 0.6243649 0.349 0.3637092
In [25]:
PollutantStats("nitrate")
Pollutant Mean Median Mode Std_dev
nitrate 1.702932 1.702932 1.1 2.52504

[^top]

Function 2: Data Quality

The function "DataQuality" will report the quality of data for each monitor ID.

Input
The function accepts one argument:

  1. "id" - An optional Monitor id vector. Default value is 1:332.

Output

  1. id: Monitor ID
  2. Obs: Total number of valid and invalid observations for the corresponding monitor ID (including missing values)
  3. Sulfate, Nitrate: Number of valid data points of each pollutant.
  4. SulfateQ, NitrateQ: Percentage of valid data points of each pollutant.
  5. PairWise: Number of valid data points of both pollutants together.
  6. PairWiseQ: Percentafe of valid data points of both pollutants together
In [86]:
DataQuality<-function(id=1:332){
    p<-c("sulfate","nitrate")
    # Create an empty data frame
    x=integer()
    q<-data.frame(id=x,Obs=x,Sulfate=x,Nitrate=x,PairWise=x)
    for (i in id){
        # Reading each csv file into a dataframe.
        df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
        obs<-nrow(df)
        sulfate<-nrow(df[!(is.na(df[p[1]])),])
        nitrate<-nrow(df[!(is.na(df[p[2]])),])
        pairwise<-nrow(df[!(is.na(df[p[1]])|is.na(df[p[2]])),])
        q[nrow(q)+1,]<-c(i,obs,sulfate,nitrate,pairwise)
    }
    # Add SulfateQ, NitrateQ and PairWiseQ columns.
    q$SulfateQ<-round(q$Sulfate/q$Obs*100,1)
    q$NitrateQ<-round(q$Nitrate/q$Obs*100,1)
    q$PairWiseQ<-round(q$PairWise/q$Obs*100,1)
    q
}
In [87]:
DataQuality(1)
id Obs Sulfate Nitrate PairWise SulfateQ NitrateQ PairWiseQ
1 1461 117 122 117 8 8.4 8
In [88]:
DataQuality(c(5,10,15,20,15))
id Obs Sulfate Nitrate PairWise SulfateQ NitrateQ PairWiseQ
5 2922 402 405 402 13.8 13.9 13.8
10 1096 148 183 148 13.5 16.7 13.5
15 730 83 85 83 11.4 11.6 11.4
20 1461 124 128 124 8.5 8.8 8.5
15 730 83 85 83 11.4 11.6 11.4
In [89]:
DataQuality(44:40)
id Obs Sulfate Nitrate PairWise SulfateQ NitrateQ PairWiseQ
44 2557 283 287 283 11.1 11.2 11.1
43 1095 74 77 74 6.8 7.0 6.8
42 1095 60 68 60 5.5 6.2 5.5
41 2556 227 243 227 8.9 9.5 8.9
40 365 21 21 21 5.8 5.8 5.8
In [90]:
dq<-DataQuality()
In [91]:
summary(dq$PairWiseQ)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    9.90   12.30   12.98   15.32   41.10 

[^top]

Function 3: Correlation Test

The "Correlation" function will calculate the correlation coefficient and significance(correlation test) between the pollutants Sulfate and Nitrate for a range of minitor ids that meet required pair wise data quality.

Input

  1. pwq - Pair wise data quality in percentage. Default value is 0.
  2. id - An optional Monitor id vector. Default value is 1:332.

Output

  1. Pearson correlation test summary
In [128]:
Correlation<-function(pwq=0,id=1:332){
    p<-c("sulfate","nitrate")
    # Init empty data frame
    dfc<-as.data.frame(setNames(replicate(2,numeric(0), simplify = F), p))
    for (i in id){
        # Reading each csv file into a dataframe.
        df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
        # Pairwise complete obserations
        dfpw<-df[!(is.na(df[p[1]])|is.na(df[p[2]])),]
        # Checking if pairwise data quality of a monitor meets required pwq
        if (nrow(dfpw)/nrow(df)>=pwq/100){
            dfc<-rbind(dfc,dfpw[,p])
        }
    }
    # Pearson correlation test
    cor.test(dfc[,p[1]],y=dfc[,p[2]])
}
In [130]:
# Correlation test of monitors with pairwise data quality above the 35%
Correlation(35)
	Pearson's product-moment correlation

data:  dfc[, p[1]] and dfc[, p[2]]
t = -2.0727, df = 148, p-value = 0.03993
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.319614399 -0.007907202
sample estimates:
       cor 
-0.1679559 
In [127]:
# Correlation test of monitors with pairwise data quality above the mean
Correlation(12.98)
	Pearson's product-moment correlation

data:  dfc[, p[1]] and dfc[, p[2]]
t = 16.813, df = 73847, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.05456375 0.06893346
sample estimates:
      cor 
0.0617518 
In [129]:
# Correlation test of all monitors
Correlation()
	Pearson's product-moment correlation

data:  dfc[, p[1]] and dfc[, p[2]]
t = 20.916, df = 111800, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.05659269 0.06827041
sample estimates:
       cor 
0.06243369 
In [126]:
# Maximum pairwise data quality is 41.1%.
# Passing a value above that should give an eror.
Correlation(42)
Error in cor.test.default(dfc[, p[1]], y = dfc[, p[2]]): not enough finite observations
Traceback:

1. Correlation(42)
2. cor.test(dfc[, p[1]], y = dfc[, p[2]])   # at line 16 of file <text>
3. cor.test.default(dfc[, p[1]], y = dfc[, p[2]])
4. stop("not enough finite observations")

[^top]

Last updated 2020-04-17 22:16:01.288450 IST

Comments