COVID-19 data

JHU exercise data

Disclaimer: This exercise is to help people understand data sclicing operations in R. I will not be providing solutions to JHU assignments. However, these exercises will help get closer to finding solutions.

Data is provided in the form of a URL containing a zip file. There are more than one ways to extract data from zip files.

Data extraction - Method 1

  1. Download zip file into working directory.
  2. Decompress the zip file into working directory.
  3. Check data file format.
  4. Load decompressed file into a dataframe.

Disadvantage: Will end up with two files (a zip file and a data file) in the working directory which are unnecessary.

In [1]:
# Download the dataset from the URL to the working directory and name it as "data.zip"
download.file("https://d396qusza40orc.cloudfront.net/rprog/data/quiz1_data.zip","data.zip")
#unzip the zip file to a directory called "data"
unzip("data.zip",exdir="data")
# List file in the "data" directory to check data file format
list.files("data")
'hw1_data.csv'

The data is available in a CSV (comma seperated values) file. We will use read.csv to read data into a dataframe.

In [2]:
# Reading data into a dataframe called "df"
df_data1<-read.csv("data/hw1_data.csv")
# View sample data.
head(df_data1)
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

Data extraction - Method 2

Earlier, we ended up with two unnecessary files and a directory in our working directory. Let's try another way to extracting data without downloading anything to our working directory.

  1. Download zip file into kernel/memory.
  2. Check data file format.
  3. Load data file into a dataframe.
In [3]:
# Create an empty file in memory
file<-tempfile()
# Download the zip file to this empty file
download.file("https://d396qusza40orc.cloudfront.net/rprog/data/quiz1_data.zip",file)
# Check data file format
unzip(file,list=TRUE)
Name Length Date
hw1_data.csv 2902 2012-12-26 14:25:00
In [4]:
# Load data file to dataframe
df_data<-read.csv(unz(file,"hw1_data.csv"))
# Delete file from memory
unlink(file)
# View sample data
head(df_data)
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

Checking if both dataframes are identical

In [5]:
identical(df_data,df_data1)
TRUE
In [6]:
# Dimensions of the dataframe
dim(df_data)
  1. 153
  2. 6

df_data has 153 rows and 6 columns.
Let's write the same in data science langauge:
The dataframe contains 153 records of 6 variables or features.

In [7]:
# View the first 3 records of te dataframe to get an idea about the data in it.
head(df_data, n=3)
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
In [8]:
# View the last four records.
tail(df_data,n=4)
Ozone Solar.R Wind Temp Month Day
150 NA 145 13.2 77 9 27
151 14 191 14.3 75 9 28
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30

We can also find the number of records by finding the length of any one of the variables (example, Ozone)

In [9]:
length(df_data$Ozone)
153

Data slicing (also called subsetting in R) is done using the [ function. The standard syntax is [rows,columns]. If we need to slice by rows, we can omit columns like [rows,]. Do not forget the trailing comma because without it, R assumes that we are slicing by columns by default or slicing a vector.

In [15]:
# Ozone level of 29th row
df_data[29,"Ozone"]
45

If the variables of a dataframe are named, then we can also use their names the following way to achieve the same result as above.

In [17]:
# Ozone level of 29th row
df_data$Ozone[29]
45

How did this work? When we call a variable of a dataframe, it becomes a vector. Here df_data$Ozone is an integer vector. We viewed the 29th element of the vector.

In [18]:
class(df_data$Ozone)
'integer'

Missing values: is.na() provides a logical vector of missing values - TRUE if NA and FALSE if not NA. In R, TRUE = 1 and FALSE = 0. A sum function will output the sum of all TRUE values (or missing values).

Number of missing values in variable Solar.R:

In [21]:
# Number of missing values in variable Solar.R
sum(is.na(df_data$Solar.R))
7

Mean of the variable Solar.R excluding missing values:

In [23]:
# Mean of the variable Solar.R excluding missing values
mean(df_data$Solar.R, na.rm=TRUE)
185.931506849315

Mean of Solar.R where Ozone levels are lower than 25 and Temp is below 88:

In [25]:
# Mean of Solar.R where Ozone levels are lower than 25 and Temp is below 88.
mean(df_data$Solar.R[df_data$Ozone<25&df_data$Temp<88],na.rm=TRUE)
145.163265306122

Average Ozone level in the month of May:

In [27]:
# Average Ozone level in the month of May
mean(df_data$Ozone[df_data$Month==5], na.rm=TRUE)
23.6153846153846

Maximum Solar Radiation in June:

In [29]:
# Maximum Solar Radiation in June
# More info on usage of max() function: http://www.endmemo.com/program/R/max.php
max(df_data$Solar.R[df_data$Month==6],na.rm=TRUE)
332

[^top]

COVID data by JHU CSSE

Source: https://github.com/CSSEGISandData/COVID-19
Data URL: http://link.datascience.eu.org/p003d1
Data last updated: 2020-04-13

In [33]:
# Loading data to dataframe
df_JHU<-read.csv("http://link.datascience.eu.org/p003d1")
head(df_JHU, n=2)
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
45001 Abbeville South Carolina US 2020-04-13 23:07:54 34.22333 -82.46171 9 0 0 9 Abbeville, South Carolina, US
22001 Acadia Louisiana US 2020-04-13 23:07:54 30.29506 -92.41420 101 5 0 96 Acadia, Louisiana, US

Pruning dataframe by stripping away unrequired variables:
We only need Country, State, Confirmed, Deaths, Recovered and Active.

In [34]:
df_JHU<-df_JHU[,c("Country_Region","Province_State","Confirmed","Deaths","Recovered","Active")]
head(df_JHU,n=2)
Country_Region Province_State Confirmed Deaths Recovered Active
US South Carolina 9 0 0 9
US Louisiana 101 5 0 96

Confirmed cases, deaths and death rates in US and India vs Global:

In [39]:
# Confirmed cases
c_US<-sum(df_JHU[df_JHU$Country_Region=="US","Confirmed"])
c_IN<-sum(df_JHU[df_JHU$Country_Region=="India","Confirmed"])
c_GL<-sum(df_JHU["Confirmed"])
c<-c(c_US,c_IN,c_GL)
# Deaths
d_US<-sum(df_JHU[df_JHU$Country_Region=="US","Deaths"])
d_IN<-sum(df_JHU[df_JHU$Country_Region=="India","Deaths"])
d_GL<-sum(df_JHU["Deaths"])
d<-c(d_US,d_IN,d_GL)
# Death rates
dr<-d/c*100
# Creating a dataframe
df_dr<-cbind(Country=c("US","India","Global"),Confirmed=c,Deaths=d,Death_Rate=dr)
df_dr
Country Confirmed Deaths Death_Rate
US 580619 23529 4.0523992497662
India 10453 358 3.42485410886827
Global 1917320 119483 6.23177143095571

The numbers are close to the reports from https://www.worldometers.info/coronavirus/
Worldometer report

[^top]

In [40]:
df_kaggle<-read.csv("http://link.datascience.eu.org/p003d2")
head(df_kaggle,n=2)
SNo ObservationDate Province.State Country.Region Last.Update Confirmed Deaths Recovered
1 01/22/2020 Anhui Mainland China 1/22/2020 17:00 1 0 0
2 01/22/2020 Beijing Mainland China 1/22/2020 17:00 14 0 0

Pruning dataframe by stripping away unrequired variables:
We only need Country, State, Confirmed, Deaths, Recovered.

In [41]:
df_kaggle<-df_kaggle[,c("Country.Region","Province.State","Confirmed","Deaths","Recovered")]
head(df_kaggle,n=2)
Country.Region Province.State Confirmed Deaths Recovered
Mainland China Anhui 1 0 0
Mainland China Beijing 14 0 0

Confirmed cases, deaths and death rates in US and India vs Global:

In [42]:
# Confirmed cases
c_US<-sum(df_kaggle[df_kaggle$Country.Region=="US","Confirmed"])
c_IN<-sum(df_kaggle[df_kaggle$Country.Region=="India","Confirmed"])
c_GL<-sum(df_kaggle["Confirmed"])
c<-c(c_US,c_IN,c_GL)
# Deaths
d_US<-sum(df_kaggle[df_kaggle$Country.Region=="US","Deaths"])
d_IN<-sum(df_kaggle[df_kaggle$Country.Region=="India","Deaths"])
d_GL<-sum(df_kaggle["Deaths"])
d<-c(d_US,d_IN,d_GL)
# Death rates
dr<-d/c*100
# Creating a dataframe
df_dr<-cbind(Country=c("US","India","Global"),Confirmed=c,Deaths=d,Death_Rate=dr)
df_dr
Country Confirmed Deaths Death_Rate
US 6278333 193275 3.07844454889538
India 82548 2526 3.06003779619131
Global 29221460 1511343 5.1720311031687

The numbers are way off compared to the reports from https://www.worldometers.info/coronavirus/. The source data needs to be checked.
Worldometer report

[^top]

COVID by EU Open Data Portal

Source: https://opendata.ecdc.europa.eu/covid19/casedistribution/csv
Data URL: http://link.datascience.eu.org/p003d3
Data last updated: 2020-04-14

In [43]:
df_EU<-read.csv("http://link.datascience.eu.org/p003d3")
head(df_EU,n=2)
dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018
14/04/2020 14 4 2020 58 3 Afghanistan AF AFG 37172386
13/04/2020 13 4 2020 52 0 Afghanistan AF AFG 37172386

Pruning dataframe by stripping away unrequired variables:
We only need Country, cases, deaths, population.

In [44]:
df_EU<-df_EU[,c("countriesAndTerritories","cases","deaths","popData2018")]
head(df_EU,n=2)
countriesAndTerritories cases deaths popData2018
Afghanistan 58 3 37172386
Afghanistan 52 0 37172386

Confirmed cases, deaths, death rates and cases per million population in US and India vs Global:

In [49]:
# cases cases
c_US<-sum(df_EU[df_EU$countriesAndTerritories=="United_States_of_America","cases"])
c_IN<-sum(df_EU[df_EU$countriesAndTerritories=="India","cases"])
c_GL<-sum(df_EU["cases"])
c<-c(c_US,c_IN,c_GL)
# deaths
d_US<-sum(df_EU[df_EU$countriesAndTerritories=="United_States_of_America","deaths"])
d_IN<-sum(df_EU[df_EU$countriesAndTerritories=="India","deaths"])
d_GL<-sum(df_EU["deaths"])
d<-c(d_US,d_IN,d_GL)
# Death rates
dr<-d/c*100
# Population
p_US<-df_EU[df_EU$countriesAndTerritories=="United_States_of_America","popData2018"][1]
p_IN<-df_EU[df_EU$countriesAndTerritories=="India","popData2018"][1]
p_GL<-7631091112     # World pop in 2018 per https://www.populationpyramid.net/world/2018/
p<-c(p_US,p_IN,p_GL)
# Cases per million population
cpm<-c/p*1000000
# Creating a dataframe
df_dr<-cbind(Country=c("United_States_of_America","India","Global"),cases=c,deaths=d,Death_Rate=dr, Cases_per_Mil=cpm)
df_dr
Country cases deaths Death_Rate Cases_per_Mil
United_States_of_America 582594 23649 4.05925910668493 1780.72124378981
India 10363 339 3.27125349802181 7.66144258651698
Global 1873265 118854 6.34475100960088 245.478002097795

The numbers are close to the reports from https://www.worldometers.info/coronavirus/
Worldometer report

[^top]

Last updated 2020-04-15 11:52:46.054857 IST

Comments