Air Pollution data
Introduction ¶
Data
The zip file containing the data can be downloaded here:
AirPollution.zip (2.67MB)
The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains three variables:
Date: the date of the observation in YYYY-MM-DD format (year-month-day)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)
This data will be used in the following functions. The goal is load data from these files without dowloading or storing them in the hard disk drive, but be available to the functions to woek on them.
# Create an empty file in memory
file<-tempfile()
# Download the zip file to this empty file
download.file("http://link.datascience.eu.org/p004d1",file,mode="wb")
# Check data file format
head(unzip(file,list=TRUE))
[^top]
Function 1: Mean, Median, Mode, Standard Deviation ¶
The function "PollutantStats" will calculate the mean, median, mode and standard deviation of a pollutant (sulfate or nitrate) across a specified list of monitors.
Input
The function "PollutantStats" accepts two arguments:
- "pollutant" - Pollutant name sulfate or nitrate
- "id" - An optional Monitor id vector. Default value is 1:332.
Output
Mean, median, mode and standard deviation of the pollutant.
PollutantStats<-function(pollutant,id=1:332){
# Initialize empty vectors.
p<-c() # Pollutant vector
# Loop through each file number.
for (i in id){
# We need to have the file path in "data/001.csv" format to use read.csv() function.
# Use sprintf to combine directory name and file number in required format.
df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
# Using a comma in df[,pollutant] converts it into a numeric vector.
# Else it remains a dataframe and pollutant_vector becomes a list.
p<-c(p,df[,pollutant])
}
# Calculate mean, median, mode and SD excluding missing values.
Mean<-mean(p,na.rm=TRUE)
Median<-mean(p,na.rm=TRUE)
Mode<-names(table(p))[table(p)==max(table(p))]
SD<-sd(p,na.rm=TRUE)
data.frame(Pollutant=pollutant,Mean=Mean,Median=Median,Mode=Mode,Std_dev=SD)
}
PollutantStats("sulfate",1:12)
PollutantStats("nitrate",74:80)
PollutantStats("sulfate",10)
PollutantStats("nitrate")
[^top]
Function 2: Data Quality ¶
The function "DataQuality" will report the quality of data for each monitor ID.
Input
The function accepts one argument:
- "id" - An optional Monitor id vector. Default value is 1:332.
Output
- id: Monitor ID
- Obs: Total number of valid and invalid observations for the corresponding monitor ID (including missing values)
- Sulfate, Nitrate: Number of valid data points of each pollutant.
- SulfateQ, NitrateQ: Percentage of valid data points of each pollutant.
- PairWise: Number of valid data points of both pollutants together.
- PairWiseQ: Percentafe of valid data points of both pollutants together
DataQuality<-function(id=1:332){
p<-c("sulfate","nitrate")
# Create an empty data frame
x=integer()
q<-data.frame(id=x,Obs=x,Sulfate=x,Nitrate=x,PairWise=x)
for (i in id){
# Reading each csv file into a dataframe.
df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
obs<-nrow(df)
sulfate<-nrow(df[!(is.na(df[p[1]])),])
nitrate<-nrow(df[!(is.na(df[p[2]])),])
pairwise<-nrow(df[!(is.na(df[p[1]])|is.na(df[p[2]])),])
q[nrow(q)+1,]<-c(i,obs,sulfate,nitrate,pairwise)
}
# Add SulfateQ, NitrateQ and PairWiseQ columns.
q$SulfateQ<-round(q$Sulfate/q$Obs*100,1)
q$NitrateQ<-round(q$Nitrate/q$Obs*100,1)
q$PairWiseQ<-round(q$PairWise/q$Obs*100,1)
q
}
DataQuality(1)
DataQuality(c(5,10,15,20,15))
DataQuality(44:40)
dq<-DataQuality()
summary(dq$PairWiseQ)
[^top]
Function 3: Correlation Test ¶
The "Correlation" function will calculate the correlation coefficient and significance(correlation test) between the pollutants Sulfate and Nitrate for a range of minitor ids that meet required pair wise data quality.
Input
- pwq - Pair wise data quality in percentage. Default value is 0.
- id - An optional Monitor id vector. Default value is 1:332.
Output
- Pearson correlation test summary
Correlation<-function(pwq=0,id=1:332){
p<-c("sulfate","nitrate")
# Init empty data frame
dfc<-as.data.frame(setNames(replicate(2,numeric(0), simplify = F), p))
for (i in id){
# Reading each csv file into a dataframe.
df<-read.csv(unz(file,sprintf("data/%03d.csv",i)))
# Pairwise complete obserations
dfpw<-df[!(is.na(df[p[1]])|is.na(df[p[2]])),]
# Checking if pairwise data quality of a monitor meets required pwq
if (nrow(dfpw)/nrow(df)>=pwq/100){
dfc<-rbind(dfc,dfpw[,p])
}
}
# Pearson correlation test
cor.test(dfc[,p[1]],y=dfc[,p[2]])
}
# Correlation test of monitors with pairwise data quality above the 35%
Correlation(35)
# Correlation test of monitors with pairwise data quality above the mean
Correlation(12.98)
# Correlation test of all monitors
Correlation()
# Maximum pairwise data quality is 41.1%.
# Passing a value above that should give an eror.
Correlation(42)
[^top]
Last updated 2020-04-17 22:16:01.288450 IST
Comments