11  EJ - Exploratory Analysis Case Study

Today we will practice visualization-based analysis of EJ data using R.

Today we will be iterating on a visualization-based analysis as shown in the conceptual model in Figure 14.1

Figure 11.1: Tidyverse framework for exploratory analysis

11.1 Load and Import Steps

Load the libraries we’ll be using today.

Import the SoCalEJ dataset again, if you don’t have it already loaded.

URL.path <- 'https://raw.githubusercontent.com/RadicalResearchLLC/EDVcourse/main/CalEJ4/CalEJ.geoJSON'
SoCalEJ <- st_read(URL.path) %>% 
  st_transform("+proj=longlat +ellps=WGS84 +datum=WGS84")

11.2 Basic Visualization

Let’s compare some variables by county to see how the counties are different.

Pick your own variable to plot - do not pick OzoneP which is my variable for now.

Use filter(...) to only keep values above or equal to zero. We don’t want to include census tracts that are missing data that have -999 values.

Figure 11.2 shows the distribution of ozone exposure percentages by county.

SoCalEJ %>% 
  filter(OzoneP >= 0) %>% 
  ggplot(aes(x = County, y = OzoneP)) +
  geom_boxplot()

Figure 11.2: Ozone census tract distribution by county

There are clear differences in ozone by county, with the Inland counties having higher ozone than the coastal counties. The differences are statistically significant.

Unfortunately, a wide dataset is not great at displaying multivariate information in ggplot. A bit of tidying is needed to display multiple variables.

11.3 Tidy and Transform

I am going to demonstrate a few somewhat fancy data manipulation techniques. This is somewhat advanced database programming. While this is very helpful for visualization, it goes beyond the things I expect you to learn for this course.

This code does three things.

  1. Remove the geometry using st_set_geometry(value = NULL)
  2. Transform the data table from wide to long using pivot_longer(...)
  3. Remove values below zero using filter()
# select socioeconomic indicators and make them narrow - only include counties above 70%
SoCal_narrow <- SoCalEJ %>% 
  st_set_geometry(value = NULL) %>% 
  pivot_longer(cols = c(5:66), names_to = 'variable', values_to = 'value') %>% 
  filter(value >=0)

Let’s compare the SoCalEJ and SoCal_narrow datasets using head().

head(SoCalEJ)
Simple feature collection with 6 features and 66 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -117.874 ymin: 33.5556 xmax: -117.7157 ymax: 33.64253
CRS:           +proj=longlat +ellps=WGS84 +datum=WGS84
       Tract   ZIP County     ApproxLoc TotPop19   CIscore   CIscoreP
1 6059062640 92656 Orange   Aliso Viejo     3741  9.642007 12.1028744
2 6059062641 92637 Orange   Aliso Viejo     5376 10.569290 14.3343419
3 6059062642 92625 Orange Newport Beach     2834  3.038871  0.6807867
4 6059062643 92657 Orange Newport Beach     7231  6.538151  5.8497226
5 6059062644 92660 Orange Newport Beach     8487  8.873604 10.4009077
6 6059062645 92657 Orange Newport Beach     6527  6.033648  4.7402925
       Ozone   OzoneP     PM2_5  PM2_5_P   DieselPM DieselPM_P  Pesticide
1 0.05165298 65.36403  9.445785 43.88301 0.14536744   50.13068 0.00000000
2 0.05219839 66.80772  9.785209 46.89484 0.07372588   27.41755 0.00000000
3 0.04827750 55.38270 10.433417 52.03485 0.03478566   12.40821 0.06496406
4 0.04901945 58.23273 10.013395 49.29683 0.03594981   12.90604 0.00000000
5 0.04863521 56.96329 10.565404 52.63223 0.07701293   28.73678 0.00000000
6 0.04863521 56.96329 10.329985 51.28811 0.04047264   14.58619 0.00000000
  PesticideP   Tox_Rel Tox_Rel_P   Traffic TrafficP DrinkWat DrinkWatP
1    0.00000  293.8420  41.44786  999.0182  57.7125 177.3831  5.907331
2    0.00000  681.0703  57.65191 1051.9778  61.2250 312.7169 25.939803
3   22.58621 1557.6400  74.05601  874.8193  49.5000 332.6654 32.284251
4    0.00000 1349.2300  70.69267 1169.7532  67.4250 332.6654 32.284251
5    0.00000 1889.5133  78.03201 1383.9867  74.9500 332.6654 32.284251
6    0.00000 1589.1400  74.44361 1025.2790  59.5750 332.6654 32.284251
       Lead    Lead_P Cleanup  CleanupP GWThreat GWThreatP HazWaste HazWasteP
1 18.450498 10.737240     0.0  0.000000      0.0  0.000000    0.280  46.80344
2  9.898042  3.730309     0.0  0.000000      0.0  0.000000    0.280  46.80344
3 14.263664  6.830498     4.5 40.836133      2.0 14.311805    0.150  26.67108
4  5.594984  1.600504     0.0  0.000000      0.5  2.722896    0.210  37.68365
5 16.777373  9.061122     0.0  0.000000      7.5 39.448780    0.395  60.22502
6 12.086059  5.343415     0.7  9.593132      2.0 14.311805    0.100  16.63799
  ImpWatBod ImpWatBodP SolWaste SolWasteP PollBurd PolBurdSc PolBurdP Asthma
1         7   66.73667        0   0.00000 30.50123  3.724194 18.95457  21.70
2         7   66.73667        2  52.89805 35.23480  4.302163 29.71998  21.74
3         8   72.15456        0   0.00000 35.68847  4.357555 30.77785  14.93
4         6   58.69383        0   0.00000 30.97653  3.782228 19.87554  10.33
5         2   23.87652        2  52.89805 39.48486  4.821094 41.44368  13.88
6         3   33.15834        2  52.89805 32.98028  4.026885 24.04480  10.51
     AsthmaP LowBirtWt   LowBirWP Cardiovas CardiovasP Educatn     EducatP
1 11.2786640      4.45 37.2337696      9.74 27.5049850     1.0    1.771703
2 11.3908275      3.50 16.2176033      8.72 18.8310070     8.4   35.876993
3  3.7263210      1.18  0.2950988      5.41  1.8569292  -999.0 -999.000000
4  0.9845464      5.39 61.9450860      4.46  0.3988036     2.0    5.859276
5  2.7542373      2.86  7.6597383      5.43  1.9192423     3.7   14.781068
6  1.0343968      4.64 42.1991275      4.65  0.4860419     0.0    0.000000
  Ling_Isol Ling_IsolP Poverty  PovertyP Unempl     UnemplP HousBurd HousBurdP
1       1.1   7.375829    20.0 34.208543    3.0   17.113483     19.9 62.420786
2       4.4  33.942347    12.9 16.821608    2.6   11.868818     19.5 60.925222
3       0.0   0.000000     9.5  8.982412    0.9    1.145237     14.5 35.817490
4       3.9  30.694275     6.9  4.170854    2.6   11.868818      8.5  8.504436
5       5.0  37.664095    12.3 15.326633    4.0   30.882353     18.9 58.225602
6       1.0   6.266071    11.5 13.517588 -999.0 -999.000000     14.8 37.477820
    PopChar PopCharSc   PopCharP Child_10 Pop_10_64 Elderly65 Hispanic   White
1 24.958604 2.5890185 13.2375189  10.9864   81.7696    7.2441  16.4662 63.2184
2 23.683405 2.4567389 11.7750882  13.2812   61.9792   24.7396  22.0238 55.2455
3  6.722867 0.6973798  0.2647504   5.9280   49.6471   44.4248   5.6104 89.6260
4 16.664505 1.7286508  4.9167927   7.9657   67.7361   24.2982   4.4530 63.9331
5 17.743511 1.8405788  5.8119012  10.4866   72.3224   17.1910   8.8488 82.0785
6 14.444279 1.4983412  3.3787191  11.0311   68.4388   20.5301   3.8302 74.7664
  AfricanAm NativeAm OtherMult Shape_Leng Shape_Area    AAPI
1    4.9452   0.0000    3.6087   5604.596    1467574 11.7616
2    1.3765   0.0000    3.5900   6244.598    2143255 17.7641
3    0.3881   0.0000    1.4467   6803.947    2048571  2.9287
4    0.0000   0.3042    6.0987  26448.643   18729420 25.2109
5    0.0000   0.0000    1.1076  10964.108    4361037  7.9651
6    0.2911   0.0000    0.8733   9833.066    5754643 20.2390
                        geometry
1 MULTIPOLYGON (((-117.7178 3...
2 MULTIPOLYGON (((-117.7166 3...
3 MULTIPOLYGON (((-117.8596 3...
4 MULTIPOLYGON (((-117.7986 3...
5 MULTIPOLYGON (((-117.8521 3...
6 MULTIPOLYGON (((-117.8269 3...
head(SoCal_narrow)
# A tibble: 6 × 6
       Tract   ZIP County ApproxLoc   variable     value
       <dbl> <dbl> <chr>  <chr>       <chr>        <dbl>
1 6059062640 92656 Orange Aliso Viejo TotPop19 3741     
2 6059062640 92656 Orange Aliso Viejo CIscore     9.64  
3 6059062640 92656 Orange Aliso Viejo CIscoreP   12.1   
4 6059062640 92656 Orange Aliso Viejo Ozone       0.0517
5 6059062640 92656 Orange Aliso Viejo OzoneP     65.4   
6 6059062640 92656 Orange Aliso Viejo PM2_5       9.45  

The SoCal_narrow dataset has taken the 60+ columns from SoCalEJ and condensed them into a single column indicating the variable and another column indicating the value for that variable. This is very useful for grouping and visualizing by category of information.

Now let’s display a box plot with three pollution variables simultaneously using this narrow dataset as shown in Figure 11.3. We again use filter(), but we combine it with the %in% operator to select multiple variables to display.

SoCal_narrow %>% 
  filter(variable %in% c('OzoneP', 'DieselPM_P', 'PolBurdP')) %>% 
  ggplot(aes(x = County, y = value, fill= variable)) +
  geom_boxplot()

Figure 11.3: Comparison of ozone, diesel PM, and Pollution burden by county

Cool! Now we are seeing some interesting differences.

11.3.1 Exercise 1.

  1. Create a boxplot that displays five simultaneous variables by County, either by adding two new variables and/or replacing the existing variables. I recommend showing the percentage values that end in P.
  2. Choose a different theme()
  3. Show a box plot of the six racial and ethnic variables - Hispanic, White, AfricanAm, NativeAm, OtherMult, and AAPI. It should look something like Figure 11.4

Figure 11.4: Comparison of racial and ethnic population distributions by county

11.3.2 Explore the Dataset

Data visualization isn’t just a final product. To get to the final product usually requires doing significant visual exploration to reveal information and knowledge.

Let’s walk through a few examples of methods to explore the data.

11.3.2.1 Scatter plots

Is there a relationship between a dependent and an independent variable or 3? Scatter plots and fits help to examine that.

Figure 11.5 investigates poverty as an independent variable with the pollution burden indicator by county.

SoCalEJ %>% 
  filter(PovertyP >= 0) %>% 
  ggplot(aes(x = PovertyP, y = PolBurdP, color = County)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth() +
  theme_bw() +
  facet_wrap(~County)

Figure 11.5: Relationship of poverty and pollution burden by county

Very interesting dataset here. Poverty percentage in a census tract increases pollution burden in Orange, LA, and Riverside County but has no impact in San Bernardino. Riverside is the least pollution burdened on average, while both LA and Orange County have the highest pollution burden.

Let’s look at one other scatter plot of estimated pollution burden and a health outcome.

SoCalEJ %>% 
  filter(PolBurdP >= 0 & CardiovasP >= 0) %>% 
  ggplot(aes(x = PolBurdP, y = CardiovasP, color = County)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth() +
  theme_bw() +
  facet_wrap(~County)

Figure 11.6: Relationship of pollution burden and cardiovascular disease by county

Very strange here. Orange County has a positive relationship, but the other counties have non-linear relationships between these two variables.

11.3.2.2 Exercise 2.

  1. Generate a hypothesis of a causal relationship that you can test. Does variable X cause variable Y to increase/decrease?
  2. Prepare a four-county scatter-plot of your selected variables you think may have a causal relationship.
  3. Examine the results. Is there a relationship? Does it vary by county?

11.3.2.3 Histograms

Histograms are useful ways to explore a distribution of values.

The basic histogram is shown in Figure 11.7. The distribution of high pollution burden census tracts is skewed right towards higher values (i.e., worst scores).

SoCalEJ %>% 
  filter(PolBurdP >= 0)  %>% 
  ggplot(aes(x = PolBurdP)) +
  geom_histogram() #+
  #theme_bw() +
  #facet_wrap(~County)

Figure 11.7: Distribution of pollution burden scores by census tract

Now let’s make that prettier and add a facet_wrap() by county as shown in Figure 11.8. I’ll also fix the axis labels using the labs() function to name them real names.

SoCalEJ %>% 
  filter(PolBurdP >= 0)  %>% 
  ggplot(aes(x = PolBurdP, fill = County)) +
  geom_histogram() +
  theme_bw() +
  facet_wrap(~County) +
  labs(x = 'Pollution Burden (%)', 
       y = 'Count of census tracts')

Figure 11.8: Distribution of pollution burden scores by census tract and county

Now, we can clearly see very big differences in census tract counts and distributions of the pollution burden variable. LA County has a massive distribution of highly burdened census tracts.

11.3.2.4 Exercise 3.

  1. Choose a variable you think is interesting and make a four-county histogram plot of it.
  2. Is there a story that you can start to craft with your histogram?

11.3.3 Bar and Column plots (column is usually better)

Basic example of a bar plot is shown in Figure 11.9.

SoCalEJ %>% 
  filter(PolBurdP >= 0)  %>% 
  ggplot(aes(x = County)) +
  geom_bar() 

Figure 11.9: Basic bar plot

There are more census tracts in LA than the other counties, because far more people live in LA than the other counties.

We can try to put the places in there, but it gets messy as shown in fig-Bar2 when looking at San Bernardino places. I’ve switched them to the y-axis to make it horizontal and made the font text smaller using

SoCalEJ %>% 
  filter(County == 'San Bernardino') %>% 
  ggplot(aes(y = ApproxLoc)) +
  geom_bar() +
  theme(axis.text = element_text(size = 6)) +
  labs(y = '', x = 'Count of census tracts')

Figure 11.10: Basic bar plot

Note that geom_bar() works well with categorical variables, but doesn’t like continuous and numerical values.

11.3.4 Exploration and Improvisation

In this section, we are going to follow your interests to generate visualizations improvisationally. Hopefully there will be minimal struggle.