Contents

JHU exercise data
COVID data by JHU CSSE
COVID data by Kaggle
COVID by EU Open Data Portal

JHU exercise data ¶

Disclaimer: This exercise is to help people understand data sclicing operations in R. I will not be providing solutions to JHU assignments. However, these exercises will help get closer to finding solutions.

Data is provided in the form of a URL containing a zip file. There are more than one ways to extract data from zip files.

Data extraction - Method 1¶

Download zip file into working directory.
Decompress the zip file into working directory.
Check data file format.
Load decompressed file into a dataframe.

Disadvantage: Will end up with two files (a zip file and a data file) in the working directory which are unnecessary.

In [1]:

# Download the dataset from the URL to the working directory and name it as "data.zip"
download.file("https://d396qusza40orc.cloudfront.net/rprog/data/quiz1_data.zip","data.zip")
#unzip the zip file to a directory called "data"
unzip("data.zip",exdir="data")
# List file in the "data" directory to check data file format
list.files("data")

'hw1_data.csv'

The data is available in a CSV (comma seperated values) file. We will use read.csv to read data into a dataframe.

In [2]:

# Reading data into a dataframe called "df"
df_data1<-read.csv("data/hw1_data.csv")
# View sample data.
head(df_data1)

Ozone	Solar.R	Wind	Temp	Month	Day
41	190	7.4	67	5	1
36	118	8.0	72	5	2
12	149	12.6	74	5	3
18	313	11.5	62	5	4
NA	NA	14.3	56	5	5
28	NA	14.9	66	5	6

Data extraction - Method 2¶

Earlier, we ended up with two unnecessary files and a directory in our working directory. Let's try another way to extracting data without downloading anything to our working directory.

Download zip file into kernel/memory.
Check data file format.
Load data file into a dataframe.

In [3]:

# Create an empty file in memory
file<-tempfile()
# Download the zip file to this empty file
download.file("https://d396qusza40orc.cloudfront.net/rprog/data/quiz1_data.zip",file)
# Check data file format
unzip(file,list=TRUE)

Name	Length	Date
hw1_data.csv	2902	2012-12-26 14:25:00

In [4]:

# Load data file to dataframe
df_data<-read.csv(unz(file,"hw1_data.csv"))
# Delete file from memory
unlink(file)
# View sample data
head(df_data)

Ozone	Solar.R	Wind	Temp	Month	Day
41	190	7.4	67	5	1
36	118	8.0	72	5	2
12	149	12.6	74	5	3
18	313	11.5	62	5	4
NA	NA	14.3	56	5	5
28	NA	14.9	66	5	6

Checking if both dataframes are identical

In [5]:

identical(df_data,df_data1)

TRUE

In [6]:

# Dimensions of the dataframe
dim(df_data)

153
6

df_data has 153 rows and 6 columns.
Let's write the same in data science langauge:
The dataframe contains 153 records of 6 variables or features.

In [7]:

# View the first 3 records of te dataframe to get an idea about the data in it.
head(df_data, n=3)

Ozone	Solar.R	Wind	Temp	Month	Day
41	190	7.4	67	5	1
36	118	8.0	72	5	2
12	149	12.6	74	5	3

In [8]:

# View the last four records.
tail(df_data,n=4)

	Ozone	Solar.R	Wind	Temp	Month	Day
150	NA	145	13.2	77	9	27
151	14	191	14.3	75	9	28
152	18	131	8.0	76	9	29
153	20	223	11.5	68	9	30

We can also find the number of records by finding the length of any one of the variables (example, Ozone)

In [9]:

length(df_data$Ozone)

153

Data slicing (also called subsetting in R) is done using the [ function. The standard syntax is [rows,columns]. If we need to slice by rows, we can omit columns like [rows,]. Do not forget the trailing comma because without it, R assumes that we are slicing by columns by default or slicing a vector.

In [15]:

# Ozone level of 29th row
df_data[29,"Ozone"]

45

If the variables of a dataframe are named, then we can also use their names the following way to achieve the same result as above.

In [17]:

# Ozone level of 29th row
df_data$Ozone[29]

45

How did this work? When we call a variable of a dataframe, it becomes a vector. Here df_data$Ozone is an integer vector. We viewed the 29th element of the vector.

In [18]:

class(df_data$Ozone)

'integer'

Missing values: is.na() provides a logical vector of missing values - TRUE if NA and FALSE if not NA. In R, TRUE = 1 and FALSE = 0. A sum function will output the sum of all TRUE values (or missing values).

Number of missing values in variable Solar.R:

In [21]:

# Number of missing values in variable Solar.R
sum(is.na(df_data$Solar.R))

7

Mean of the variable Solar.R excluding missing values:

In [23]:

# Mean of the variable Solar.R excluding missing values
mean(df_data$Solar.R, na.rm=TRUE)

185.931506849315

Mean of Solar.R where Ozone levels are lower than 25 and Temp is below 88:

In [25]:

# Mean of Solar.R where Ozone levels are lower than 25 and Temp is below 88.
mean(df_data$Solar.R[df_data$Ozone<25&df_data$Temp<88],na.rm=TRUE)

145.163265306122

Average Ozone level in the month of May:

In [27]:

# Average Ozone level in the month of May
mean(df_data$Ozone[df_data$Month==5], na.rm=TRUE)

23.6153846153846

Maximum Solar Radiation in June:

In [29]:

# Maximum Solar Radiation in June
# More info on usage of max() function: http://www.endmemo.com/program/R/max.php
max(df_data$Solar.R[df_data$Month==6],na.rm=TRUE)

332

[^top]

COVID data by JHU CSSE ¶

Source: https://github.com/CSSEGISandData/COVID-19
Data URL: http://link.datascience.eu.org/p003d1
Data last updated: 2020-04-13

In [33]:

# Loading data to dataframe
df_JHU<-read.csv("http://link.datascience.eu.org/p003d1")
head(df_JHU, n=2)

FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Recovered	Active	Combined_Key
45001	Abbeville	South Carolina	US	2020-04-13 23:07:54	34.22333	-82.46171	9	0	0	9	Abbeville, South Carolina, US
22001	Acadia	Louisiana	US	2020-04-13 23:07:54	30.29506	-92.41420	101	5	0	96	Acadia, Louisiana, US

Pruning dataframe by stripping away unrequired variables:
We only need Country, State, Confirmed, Deaths, Recovered and Active.

In [34]:

df_JHU<-df_JHU[,c("Country_Region","Province_State","Confirmed","Deaths","Recovered","Active")]
head(df_JHU,n=2)

Country_Region	Province_State	Confirmed	Deaths	Recovered	Active
US	South Carolina	9	0	0	9
US	Louisiana	101	5	0	96

Confirmed cases, deaths and death rates in US and India vs Global:

In [39]:

# Confirmed cases
c_US<-sum(df_JHU[df_JHU$Country_Region=="US","Confirmed"])
c_IN<-sum(df_JHU[df_JHU$Country_Region=="India","Confirmed"])
c_GL<-sum(df_JHU["Confirmed"])
c<-c(c_US,c_IN,c_GL)
# Deaths
d_US<-sum(df_JHU[df_JHU$Country_Region=="US","Deaths"])
d_IN<-sum(df_JHU[df_JHU$Country_Region=="India","Deaths"])
d_GL<-sum(df_JHU["Deaths"])
d<-c(d_US,d_IN,d_GL)
# Death rates
dr<-d/c*100
# Creating a dataframe
df_dr<-cbind(Country=c("US","India","Global"),Confirmed=c,Deaths=d,Death_Rate=dr)
df_dr

Country	Confirmed	Deaths	Death_Rate
US	580619	23529	4.0523992497662
India	10453	358	3.42485410886827
Global	1917320	119483	6.23177143095571

The numbers are close to the reports from https://www.worldometers.info/coronavirus/
Worldometer report

[^top]

COVID data by Kaggle ¶

Source: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/data
Data URL: http://link.datascience.eu.org/p003d2
Data last updated: 2020-04-13

In [40]:

df_kaggle<-read.csv("http://link.datascience.eu.org/p003d2")
head(df_kaggle,n=2)

SNo	ObservationDate	Province.State	Country.Region	Last.Update	Confirmed	Deaths	Recovered
1	01/22/2020	Anhui	Mainland China	1/22/2020 17:00	1	0	0
2	01/22/2020	Beijing	Mainland China	1/22/2020 17:00	14	0	0

Pruning dataframe by stripping away unrequired variables:
We only need Country, State, Confirmed, Deaths, Recovered.

In [41]:

df_kaggle<-df_kaggle[,c("Country.Region","Province.State","Confirmed","Deaths","Recovered")]
head(df_kaggle,n=2)

Country.Region	Province.State	Confirmed	Deaths	Recovered
Mainland China	Anhui	1	0	0
Mainland China	Beijing	14	0	0

Confirmed cases, deaths and death rates in US and India vs Global:

In [42]:

# Confirmed cases
c_US<-sum(df_kaggle[df_kaggle$Country.Region=="US","Confirmed"])
c_IN<-sum(df_kaggle[df_kaggle$Country.Region=="India","Confirmed"])
c_GL<-sum(df_kaggle["Confirmed"])
c<-c(c_US,c_IN,c_GL)
# Deaths
d_US<-sum(df_kaggle[df_kaggle$Country.Region=="US","Deaths"])
d_IN<-sum(df_kaggle[df_kaggle$Country.Region=="India","Deaths"])
d_GL<-sum(df_kaggle["Deaths"])
d<-c(d_US,d_IN,d_GL)
# Death rates
dr<-d/c*100
# Creating a dataframe
df_dr<-cbind(Country=c("US","India","Global"),Confirmed=c,Deaths=d,Death_Rate=dr)
df_dr

Country	Confirmed	Deaths	Death_Rate
US	6278333	193275	3.07844454889538
India	82548	2526	3.06003779619131
Global	29221460	1511343	5.1720311031687

The numbers are way off compared to the reports from https://www.worldometers.info/coronavirus/. The source data needs to be checked.
Worldometer report

[^top]

COVID by EU Open Data Portal ¶

Source: https://opendata.ecdc.europa.eu/covid19/casedistribution/csv
Data URL: http://link.datascience.eu.org/p003d3
Data last updated: 2020-04-14

In [43]:

df_EU<-read.csv("http://link.datascience.eu.org/p003d3")
head(df_EU,n=2)

dateRep	day	month	year	cases	deaths	countriesAndTerritories	geoId	countryterritoryCode	popData2018
14/04/2020	14	4	2020	58	3	Afghanistan	AF	AFG	37172386
13/04/2020	13	4	2020	52	0	Afghanistan	AF	AFG	37172386

Pruning dataframe by stripping away unrequired variables:
We only need Country, cases, deaths, population.

In [44]:

df_EU<-df_EU[,c("countriesAndTerritories","cases","deaths","popData2018")]
head(df_EU,n=2)

countriesAndTerritories	cases	deaths	popData2018
Afghanistan	58	3	37172386
Afghanistan	52	0	37172386

Confirmed cases, deaths, death rates and cases per million population in US and India vs Global:

In [49]:

# cases cases
c_US<-sum(df_EU[df_EU$countriesAndTerritories=="United_States_of_America","cases"])
c_IN<-sum(df_EU[df_EU$countriesAndTerritories=="India","cases"])
c_GL<-sum(df_EU["cases"])
c<-c(c_US,c_IN,c_GL)
# deaths
d_US<-sum(df_EU[df_EU$countriesAndTerritories=="United_States_of_America","deaths"])
d_IN<-sum(df_EU[df_EU$countriesAndTerritories=="India","deaths"])
d_GL<-sum(df_EU["deaths"])
d<-c(d_US,d_IN,d_GL)
# Death rates
dr<-d/c*100
# Population
p_US<-df_EU[df_EU$countriesAndTerritories=="United_States_of_America","popData2018"][1]
p_IN<-df_EU[df_EU$countriesAndTerritories=="India","popData2018"][1]
p_GL<-7631091112     # World pop in 2018 per https://www.populationpyramid.net/world/2018/
p<-c(p_US,p_IN,p_GL)
# Cases per million population
cpm<-c/p*1000000
# Creating a dataframe
df_dr<-cbind(Country=c("United_States_of_America","India","Global"),cases=c,deaths=d,Death_Rate=dr, Cases_per_Mil=cpm)
df_dr

Country	cases	deaths	Death_Rate	Cases_per_Mil
United_States_of_America	582594	23649	4.05925910668493	1780.72124378981
India	10363	339	3.27125349802181	7.66144258651698
Global	1873265	118854	6.34475100960088	245.478002097795

The numbers are close to the reports from https://www.worldometers.info/coronavirus/
Worldometer report

[^top]

Last updated 2020-04-15 11:52:46.054857 IST

JHU exercise data ¶

Data extraction - Method 1¶

Data extraction - Method 2¶

COVID data by JHU CSSE ¶

COVID data by Kaggle ¶

COVID by EU Open Data Portal ¶

Comments