14  Importing Data - Part 1

Today we will focus on the practice of importing data - better than last time.

Our framework for the workflow of data visualization is shown in Figure 14.1

Figure 14.1: Tidyverse framework again

Acquiring and importing data is the most complicated part of this course and data visualization in general. This Unit is done now, rather than at the beginning, because of its difficulty and pain - while providing little immediate satisfaction of a cool map or graphic. In my experience, data import and manipulation is 80+% of the work when creating visualizations; it needs to be covered at least nominally in any course on data visualization.

14.1 Load and Install Packages

As always, we should load the packages we need to import the data. There are many specialized data import packages, but tidyverse and sf are a good start and can handle many standard tables and geospatial data files. Remember, you can check to make sure a package is loaded in your R session by checking on the files, plots, and packages panel, clicking on the Packages tab, and scrolling down to tidyverse and sf to make sure they are checked.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE

14.2 Option 1. Point and Click Download, File Save, Read

The basic way to acquire data is the Point and Click method. This is a step-by-step instruction for doing that.

14.2.1 Find and Download Data

Go to CalEnviroScreen

Download the Zipped Shapefile shown in the screenshot in Figure 14.2

Figure 14.2: CalEnviroScreen Shapefile Location

By default, downloads are often placed in a Downloads directory, although you may have changed that on your local machine.

You can skip the next step if you directly save the zip file to your working directory.

14.2.2 Move the Zipped Shapefile to the R Working Directory

By default, downloads are often placed in a Downloads directory, although you may have changed that on your local machine. If this occurred in your download, the zipped needs to be either (a) moved to the R working directory or (b) identify the filepath of the default download directory and work with it from there.

For today, I will only show path (a) because it is good data science practice to keep the data in a directory associated with the visualization.

  1. Identify the directory where the zipped shapefile was downloaded. On my machine, this is a Downloads folder which can be accessed through my web browser after the file download is complete; Figure 14.3 shows an example. The name of the file is calenviroscreen40shpf2021shp.zip.

Figure 14.3: Browser download

  1. Identify the R working directory on your machine using the getwd() function.
[1] "C:/Dev/EnviroDataVis"
  1. Move calenviroscreen40shpf2021shp.zip from the default download directory to the R working directory. Either drag it, copy and paste it, or cut and paste it. For Macs - use the Finder tool. For PCs, use File Explorer.

  2. Check your Files, Plots, and Packages panel to see the zipped file is identified by RStudio. See the example in Figure 14.4.

Figure 14.4: Files, Plots, and Packages Panel

If you see the calenviroscreen40shpf2021shp.zip in the directory on your machine, congratulations! You are a winner!

14.2.3 Unzip the data - Two Ways

Although the data is in the right place, it is not directly readable while zipped.

14.2.3.1 Point and Click Unzip

I think the process is basically the same for Mac and PC, but we will identify this in class.

  • On a Mac, Double-click the .zip file. The unzipped item appears in the same folder as the .zip file.

  • On a PC, right-clicking on a zipped file will bring up a menu that includes an Extract All option. Choosing the Extract All option brings up a pathname to extract the file to. The default is to extract the zip file to a subfolder named after the zip file.

Again, go to the Files, Plots, and Packages panel and check if there is a folder called calenviroscreen40shpf2021shp as shown in Figure 14.5

Figure 14.5: Shapefile folder is in the working directory!

14.2.3.2 Unzip with Code

Same idea. Use the unzip() function to unzip the zipped shapefile folder. We will save it in a separate directory to test if this way works independently of point and click method. The unzip() function needs two arguments - the path of the zipfile =, and the export directory name exdir =.

directory <- 'CalEJ4'
unzip(zipfile = 'calenviroscreen40shpf2021shp.zip', exdir = directory)
Warning in unzip(zipfile = "calenviroscreen40shpf2021shp.zip", exdir =
directory): error 1 in extracting from zip file

Check the Files panel. Check for a new CalEJ4 folder; ?fig-panel3 shows how it looks on my machine.

Another folder is in the working directory! ### Import the Shapefile

The sf library is used to import geospatial data. The read_sf() is great at read and identifying the type of spatial file.

Shapefiles are the esri propietary geospatial format and are very common.

The CalEnviroScreen data are in the shapefile format, which is a bunch of individual files organized in a folder directory. In the calenviroscreen40shpf2021shp directory, there are 8 individual files with 8 different file extensions. We can ignore that and just point read_sf() at the directory and it will do the rest. The dsn = argument stands for data source name which can be a directory, file, or a database.

CalEJ <- read_sf(dsn = directory)

Check the Environment panel after running this line of code. Is there a CalEJ file with 8035 observations of 67 variables present?

If so, success is yours! Let’s make a map of Pesticide census tract percentiles to celebrate with ?fig-CaliforniaPesticide!

14.2.4 Visualize the data

CalEJ <- CalEJ %>% 
  filter(PesticideP >=0) %>%
  st_transform("+proj=longlat +ellps=WGS84 +datum=WGS84")

palPest <- colorNumeric(palette = 'Reds', domain = CalEJ$PesticideP)

  leaflet(data = CalEJ) %>% 
    addTiles() %>% 
    addPolygons(color = ~palPest(PesticideP),
                fillOpacity = 0.5,
                weight = 2,
                label = ~htmlEscape(ApproxLoc)) %>% 
    addLegend(pal = palPest,
              title = 'Pesticide (%)', 
              values = ~PesticideP)

14.2.5 Option 2 - Directly Read the Dataset

Methane monthly average concentrations sampled by flasks

Today I am selecting Mauna Loa (MLO) in Hawaii. Methane’s chemical formula is CH4. Therefore, I will assign the path of URL.MLO.CH4

URL.MLO.CH4 <- file.path( 'https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/txt/ch4_mlo_surface-flask_1_ccgg_month.txt')

We did this before for Alert, let’s try the successful code using the read_table() function. Note, that when I follow the link, the first line of the dataset says there are 71 header lines.

MLO.CH4 <- read_table(URL.MLO.CH4, skip = 71)

── Column specification ────────────────────────────────────────────────────────
cols(
  MLO = col_character(),
  `1983` = col_double(),
  `5` = col_double(),
  `1639.47` = col_double()
)
head(MLO.CH4)
# A tibble: 6 × 4
  MLO   `1983`   `5` `1639.47`
  <chr>  <dbl> <dbl>     <dbl>
1 MLO     1983     6     1633.
2 MLO     1983     7     1633.
3 MLO     1983     8     1631.
4 MLO     1983     9     1648.
5 MLO     1983    10     1664.
6 MLO     1983    11     1658.
headers <- c('site', 'year', 'month', 'value')
colnames(MLO.CH4) <- headers

head(MLO.CH4)
# A tibble: 6 × 4
  site   year month value
  <chr> <dbl> <dbl> <dbl>
1 MLO    1983     6 1633.
2 MLO    1983     7 1633.
3 MLO    1983     8 1631.
4 MLO    1983     9 1648.
5 MLO    1983    10 1664.
6 MLO    1983    11 1658.

This is better.

We can now visualize the data in Figure 14.6.

MLO.CH4 %>% 
  mutate(decimal.Date = (year + month/12)) %>% 
  ggplot(aes(x = decimal.Date, y = value)) +
  geom_point() +
  geom_line(alpha = 0.6) +
  geom_smooth() +
  theme_bw() +
  labs(x = 'Year', y = 'Methane concentration (ppb)',
       title = 'Mauna Loa - methane trend')
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Figure 14.6: Trend in Methane concentrations (ppb) at Mauna Loa, Hawaii

14.2.6 Advanced data visualization

Now that we have Mauna Loa, I want to add the Alert dataset to it using the code we developed last week. This code downloads the Alert dataset and renames its headers.

URL.ALT.CH4 <- file.path( 'https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/txt/ch4_alt_surface-flask_1_ccgg_month.txt')
ALT.CH4 <- read_table(URL.ALT.CH4, skip = 70)

── Column specification ────────────────────────────────────────────────────────
cols(
  ALT = col_character(),
  `1985` = col_double(),
  `6` = col_double(),
  `1728.44` = col_double()
)
colnames(ALT.CH4) <- headers

Now we can put the datasets together to make a combined visualization. The bind_rows() function from tidyverse let’s us put the datasets take together since they have the same headers. Then we can use the color argument to aes() to get two separate time series as shown in Figure 14.7. I also grouped the data by the shape of the symbol to ensure that the two datasets are distinguishable.

CH4 <- bind_rows(ALT.CH4, MLO.CH4)

CH4 %>% 
  mutate(decimal.Date = (year + month/12)) %>% 
  ggplot(aes(x = decimal.Date, y = value, color = site, shape = site)) +
  geom_point() +
  geom_line(alpha = 0.6) +
  #geom_smooth(se = FALSE) +
  theme_bw() +
  labs(x = 'Year', y = 'Methane concentration (ppb)',
       title = 'Methane trend')

Figure 14.7: Trend in Methane concentrations (ppb) at Mauna Loa, Hawaii and Alert, Canada

14.2.7 Downloading secured zip files

I have not yet found a reliable method to get this to work every time on Macs and PCs. Stay tuned.

14.3 Exercise 1.

  1. Go to the Environmental Justice Index Accessibility Tool.
  2. Pick a state from the dropdown menu.
  3. Press the Apply button
  4. An Actions button should appear; ?fig-Action shows where that is. Press the Actions button, select Export All and choose Export to geoJSON.

Action button 5. A file named DataRecords.geojson should appear in your default download folder. 6. Move the DataRecords.geojson file to the working directory. 7. Check the Files panel. Is DataRecords.geojson there?
8. Read in the file using read_sf(). The dsn argument can point directly to the file name for this type of file. Assign it a name that incorporates EJI and the state abbreviation.
9. Check the Environment panel. Did it import? 10. Make a visualization - but not a map because projections are wonky?

CA_EJI_raw <- read_sf(dsn = 'DataRecords.geojson') %>% 
  filter(rpl_eji >=0) #%>% 

# unfortunately, this data is in some weird projection that isn't properly identified but might be equal albers conformal conic
#st_is_valid(CA_EJI_raw)
#st_crs(CA_EJI_raw)

CA_EJI <- st_set_geometry(CA_EJI_raw, value = NULL)
CA_EJI_raw <- st_set_crs(CA_EJI_raw, 
                         '+proj=aea +lat_0=37.5 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +type=crs')
Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
CA_EJI %>% 
  ggplot(aes(x = epl_rail, y = epl_dslpm)) + 
  geom_point() +
  theme_bw()

EJI index for California

?fig-EJI_polygons shows Diesel PM from their environmental indicators layer for San Bernardino County.

ggplot(data = CA_EJI_raw) +
  geom_sf(aes(fill = epl_dslpm)) +
  theme_bw()