Air Pollution data

Krishnakanth Allika

2020-04-17 18:02

Contents

Introduction
Function 1: Mean, Median, Mode, Standard Deviation
Function 2: Data Quality
Function 3: Correlation Test

Introduction ¶

Data
The zip file containing the data can be downloaded here:

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains three variables:

Date: the date of the observation in YYYY-MM-DD format (year-month-day)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

This data will be used in the following functions. The goal is load data from these files without dowloading or storing them in the hard disk drive, but be available to the functions to woek on them.

In [7]:

# Create an empty file in memory
file<-tempfile()
# Download the zip file to this empty file
download.file("http://link.datascience.eu.org/p004d1",file,mode="wb")

In [10]:

# Check data file format
head(unzip(file,list=TRUE))

Name	Length	Date
data/	0	2012-09-22 00:27:00
data/001.csv	31271	2012-09-22 01:30:00
data/002.csv	81435	2012-09-22 01:30:00
data/003.csv	47185	2012-09-22 01:30:00
data/004.csv	78887	2012-09-22 01:30:00
data/005.csv	63204	2012-09-22 01:30:00

[^top]

Function 1: Mean, Median, Mode, Standard Deviation ¶

The function "PollutantStats" will calculate the mean, median, mode and standard deviation of a pollutant (sulfate or nitrate) across a specified list of monitors.

Input
The function "PollutantStats" accepts two arguments:

"pollutant" - Pollutant name sulfate or nitrate
"id" - An optional Monitor id vector. Default value is 1:332.

Output
Mean, median, mode and standard deviation of the pollutant.

In [19]:

PollutantStats<-function(pollutant,id=1:332){
    # Initialize empty vectors.
    p<-c()    # Pollutant vector
    # Loop through each file number.
    for (i in id){
        # We need to have the file path in "data/001.csv" format to use read.csv() function.
        # Use sprintf to combine directory name and file number in required format.
        df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
        # Using a comma in df[,pollutant] converts it into a numeric vector.
        # Else it remains a dataframe and pollutant_vector becomes a list.
        p<-c(p,df[,pollutant])
    }
    # Calculate mean, median, mode and SD excluding missing values.
    Mean<-mean(p,na.rm=TRUE)
    Median<-mean(p,na.rm=TRUE)
    Mode<-names(table(p))[table(p)==max(table(p))]
    SD<-sd(p,na.rm=TRUE)
    data.frame(Pollutant=pollutant,Mean=Mean,Median=Median,Mode=Mode,Std_dev=SD)
}

In [21]:

PollutantStats("sulfate",1:12)

Pollutant	Mean	Median	Mode	Std_dev
sulfate	3.765641	3.765641	2.84	2.874151

In [22]:

PollutantStats("nitrate",74:80)

Pollutant	Mean	Median	Mode	Std_dev
nitrate	2.611021	2.611021	1.24	2.802195

In [24]:

PollutantStats("sulfate",10)

Pollutant	Mean	Median	Mode	Std_dev
sulfate	0.6243649	0.6243649	0.349	0.3637092

In [25]:

PollutantStats("nitrate")

Pollutant	Mean	Median	Mode	Std_dev
nitrate	1.702932	1.702932	1.1	2.52504

[^top]

Function 2: Data Quality ¶

The function "DataQuality" will report the quality of data for each monitor ID.

Input
The function accepts one argument:

"id" - An optional Monitor id vector. Default value is 1:332.

Output

id: Monitor ID
Obs: Total number of valid and invalid observations for the corresponding monitor ID (including missing values)
Sulfate, Nitrate: Number of valid data points of each pollutant.
SulfateQ, NitrateQ: Percentage of valid data points of each pollutant.
PairWise: Number of valid data points of both pollutants together.
PairWiseQ: Percentafe of valid data points of both pollutants together

In [86]:

DataQuality<-function(id=1:332){
    p<-c("sulfate","nitrate")
    # Create an empty data frame
    x=integer()
    q<-data.frame(id=x,Obs=x,Sulfate=x,Nitrate=x,PairWise=x)
    for (i in id){
        # Reading each csv file into a dataframe.
        df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
        obs<-nrow(df)
        sulfate<-nrow(df[!(is.na(df[p[1]])),])
        nitrate<-nrow(df[!(is.na(df[p[2]])),])
        pairwise<-nrow(df[!(is.na(df[p[1]])|is.na(df[p[2]])),])
        q[nrow(q)+1,]<-c(i,obs,sulfate,nitrate,pairwise)
    }
    # Add SulfateQ, NitrateQ and PairWiseQ columns.
    q$SulfateQ<-round(q$Sulfate/q$Obs*100,1)
    q$NitrateQ<-round(q$Nitrate/q$Obs*100,1)
    q$PairWiseQ<-round(q$PairWise/q$Obs*100,1)
    q
}

In [87]:

DataQuality(1)

id	Obs	Sulfate	Nitrate	PairWise	SulfateQ	NitrateQ	PairWiseQ
1	1461	117	122	117	8	8.4	8

In [88]:

DataQuality(c(5,10,15,20,15))

id	Obs	Sulfate	Nitrate	PairWise	SulfateQ	NitrateQ	PairWiseQ
5	2922	402	405	402	13.8	13.9	13.8
10	1096	148	183	148	13.5	16.7	13.5
15	730	83	85	83	11.4	11.6	11.4
20	1461	124	128	124	8.5	8.8	8.5
15	730	83	85	83	11.4	11.6	11.4

In [89]:

DataQuality(44:40)

id	Obs	Sulfate	Nitrate	PairWise	SulfateQ	NitrateQ	PairWiseQ
44	2557	283	287	283	11.1	11.2	11.1
43	1095	74	77	74	6.8	7.0	6.8
42	1095	60	68	60	5.5	6.2	5.5
41	2556	227	243	227	8.9	9.5	8.9
40	365	21	21	21	5.8	5.8	5.8

In [90]:

dq<-DataQuality()

In [91]:

summary(dq$PairWiseQ)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    9.90   12.30   12.98   15.32   41.10

[^top]

Function 3: Correlation Test ¶

The "Correlation" function will calculate the correlation coefficient and significance(correlation test) between the pollutants Sulfate and Nitrate for a range of minitor ids that meet required pair wise data quality.

Input

pwq - Pair wise data quality in percentage. Default value is 0.
id - An optional Monitor id vector. Default value is 1:332.

Output

Pearson correlation test summary

In [128]:

Correlation<-function(pwq=0,id=1:332){
    p<-c("sulfate","nitrate")
    # Init empty data frame
    dfc<-as.data.frame(setNames(replicate(2,numeric(0), simplify = F), p))
    for (i in id){
        # Reading each csv file into a dataframe.
        df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
        # Pairwise complete obserations
        dfpw<-df[!(is.na(df[p[1]])|is.na(df[p[2]])),]
        # Checking if pairwise data quality of a monitor meets required pwq
        if (nrow(dfpw)/nrow(df)>=pwq/100){
            dfc<-rbind(dfc,dfpw[,p])
        }
    }
    # Pearson correlation test
    cor.test(dfc[,p[1]],y=dfc[,p[2]])
}

In [130]:

# Correlation test of monitors with pairwise data quality above the 35%
Correlation(35)

	Pearson's product-moment correlation

data:  dfc[, p[1]] and dfc[, p[2]]
t = -2.0727, df = 148, p-value = 0.03993
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.319614399 -0.007907202
sample estimates:
       cor 
-0.1679559

In [127]:

# Correlation test of monitors with pairwise data quality above the mean
Correlation(12.98)

	Pearson's product-moment correlation

data:  dfc[, p[1]] and dfc[, p[2]]
t = 16.813, df = 73847, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.05456375 0.06893346
sample estimates:
      cor 
0.0617518

In [129]:

# Correlation test of all monitors
Correlation()

	Pearson's product-moment correlation

data:  dfc[, p[1]] and dfc[, p[2]]
t = 20.916, df = 111800, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.05659269 0.06827041
sample estimates:
       cor 
0.06243369

In [126]:

# Maximum pairwise data quality is 41.1%.
# Passing a value above that should give an eror.
Correlation(42)

Error in cor.test.default(dfc[, p[1]], y = dfc[, p[2]]): not enough finite observations
Traceback:

1. Correlation(42)
2. cor.test(dfc[, p[1]], y = dfc[, p[2]])   # at line 16 of file <text>
3. cor.test.default(dfc[, p[1]], y = dfc[, p[2]])
4. stop("not enough finite observations")

[^top]

Last updated 2020-04-17 22:16:01.288450 IST

Introduction ¶

Function 1: Mean, Median, Mode, Standard Deviation ¶

Function 2: Data Quality ¶

Function 3: Correlation Test ¶

Comments