Scraping ESG Scores

Getting Company and Mutual Fund ESG Scores from Yahoo Finance

This is a coding (R language) walk-through on scraping sustainability/ESG data (scores, ratings, and rankings) from Yahoo Finance; for individual companies/stocks and mutual funds. These coding posts are more informal than my research articles and are about process, written for an audience of fellow sustainability/ESG coders and data analysts.

Legal/Ethics

Always scrape legally and ethically! Data scraping is ubiquitous. Websites know this and are generally permissive, albeit within reasonable bounds. From a legal standpoint, read the site’s “Terms of Service” (TOS) before writing a single line of code. From an ethics standpoint, it’s mainly about not being a burden and other common-sense courtesies.

  • Be transparent. Send a custom HTTP header and user agent string containing your name, intent, and contact info.
  • Don’t tax servers or smell like a DDoS attack. Send requests at a modest rate (smell like a human) and use timeouts.
  • If there’s a public API use it even if that means a learning curve. The site invested in that API so that we don’t scrape.
  • Honor robots.txt re: the specific the path you’re scraping. A site might permit crawling in some areas but not others.
  • Read the TOS; robots.txt is for crawlers but there’s a fine line between crawling and data mining, often prohibited.
  • Be aware of downstream legalities. A site that permits scraping could be hosting third-party copyrighted material.
  • Download only what you need and create something new from it. Don’t hoard data or re-package it untransformed.
  • Show gratitude. If you’re scraping for blog, give proper and prominent attribution and drive traffic with a backlink.

I. Context

ESG Investing in Theory

ESG investing1 is a specific strategy under the sustainable/socially-responsible investing (SRI) umbrella. ESG investing incorporates Environmental, Social, and Governance (ESG) factors into investment decisions. Typical examples are:

  • E: Energy consumption and efficiency, emissions and pollutants, water usage, waste generation and disposal
  • S: Employee diversity, LGBTQ+ rights, labor practices, community involvement, charitable giving, human rights
  • G: Board and executive diversity, executive compensation, shareholder rights, disclosure and transparency

Early on, ESG investing was a niche equated with “principles over profits.” Recently, it’s gone more mainstream. Non-financial ESG data complements traditional financial metrics; together they can lead to higher risk-adjusted returns.

ESG Investing in Practice

Unfortunately, like a lot in the investing world, the cards are stacked in favor of institutional investors when it comes to SRI/ESG investing. ESG options are limited for the individual investor, either cost-prohibitive or time-prohibitive.

  • Cost-Prohibitive: Ironically, the mainstreaming of ESG investing has led to extreme commercialization (and borderline monopolization) of ESG information. Third-party ESG data and products are insanely expensive.
  • Time-Prohibitive: The DIY approach just isn’t practical for the casual investor. Freely-available raw ESG data — which commercial products re-package — isn’t standardized, requiring a steep learning curve to interpret.

Morningstar/Sustainalytics

Options are limited, but they do exist. Particularly accessible is Morningstar/Sustainalytics ESG data on Yahoo Finance. It’s freely available for both individual companies/stocks and mutual funds under the “Sustainability” tab on the quote page. Here is a company example, and here’s a fund example. The data is similar for each, with a few differences.

Note: I’m treating Morningstar/Sustainalytics interchangeably throughout this post. It’s more nuanced, but basically a few years ago Morningstar and Sustainalytics partnered to provide ESG data for mutual funds and ETFs on Yahoo. Sustainalytics was independently supplying company/stock ESG data. But fund ESG scores are based a portfolio “roll-up” of Sustainalytics’ company scores, and in mid 2017 Morningstar acquired a 40% stake in Sustainalytics. So while there are differences, they’re basically the same re: ESG data.

This ESG data is basic (i.e., summary-level scores, ratings, rankings) but it’s a godsend for individual investors. First, it’s free and doesn’t require registration. Second, the underlying data is from Sustainalytics and Morningstar; two reputable sources of ESG and fund research respectively. Usually I wouldn’t recommend Yahoo for investing, but this is different.

II. Process

Background

I assume some knowledge of R programming, HTML/CSS, and HTTP requests. This post won’t get too technical but it helps to know how I got the element, attribute, and value (table, id, constituents) when you see something like:

html_nodes("table[id='constituents']")

I’ll go over that more later. Another thing is that while I adhere to R language best practices (naming, styling, etc.) — i.e., code should be legible and self explanatory — I do have two naming idiosyncrasies to be aware of:

  1. I preface object names with a three-letter descriptor such as lkp_ (for lookup tables), dat_ (data tables), fun_ (functions), var_ (variables), and tmp_ (temporary objects). It really helps keep track of things, particularly for large projects.
  2. Within the occasional unavoidable loop I suffix names using the iterator; e.g., “data.i” where “i” is the iterator. This lets me use the name root name in nested loops (data.i, data.x) and avoids accidental naming conflicts.

Be aware of how you name things. Careless naming can “delete” existing objects and break things. I cringe when coding tutorials use “data” for a data frame; it overwrites the base function data(). Lazy and irresponsible from a teaching point.

Step 1: Preparation

The first few steps (all sub-steps under this section) are preparatory and generic; that is, they are the same for scaping companies as for scraping mutual funds.

Step 1a: Load Libraries

Here, we simply load the libraries (aka packages) that we’ll be using throughout:

  • urltools (for a custom function)
  • httr (to fetch website content)
  • robotstxt (to scrape ethically)
  • rvest (to parse HTML and XML)
  • stringr (for string manipulation)
  • dplyr (for data manipulation)
  • tibble (to store data efficiently)

Note: The rvest, stringr, dplyr, and tibble libraries are part of the awesome tidyverse collection and could all be loaded via a single library (tidyverse) but for demonstration purposes I like to make the underlying libraries very clear.

library(urltools)
library(httr)
library(robotstxt)
library(rvest)
library(stringr)
library(dplyr)
library(tibble)

If you don’t have a particular library installed it’s just a matter of install.packages(“name”) where “name” is the name of the library/package, in quotes. Note that quotes are required for install.packages() but not library().

Step 1b: Write Functions

Following the DRY (“don’t repeat yourself”) principle of programming, I’ll first create a few functions. Doing so makes subsequent code a lot cleaner and reduces (virtually eliminates) the chance of error.

The first function (fun_parse) is the workhorse. For each iteration of a loop we’ll receive an XML document (a web page) as a response. This function extracts ESG data from that page. What data it extracts is determined by the XPath plugged into the function as an argument. The function’s workflow is as follows:

  • extracts the content (initially a text string) of the node identified by the “xpath” argument
  • if the string is empty and the XPath is for Controversy Rating return “None” (XPath quirks)
  • if the string contains “% AUM” all non-numeric characters are stripped a percent is returned
  • if the string contains all letters, it’s a rating (e.g., “Outperformer”); I only trim white space
  • if the string contains any numbers at all it’s either a score, a percentile ranking, or a date
  • if the string contains “percentile” it’s a ranking; numerals are extracted and divided by 100
  • if the string contains “updated on” it’s a date; the date string is coerced into a date object
  • anything left at this point is a score (1-100 scale); the text integer is converted to numeric
fun_parse <- function(xpath, xmldoc = page.i) {
  x <- xmldoc %>% 
    html_nodes(xpath = xpath) %>%
    html_text(trim = TRUE)
  if (length(x) == 0 & xpath == '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div/div[2]/div[1]/span/span/span') {
    return("None")
  }
  if (grepl("% AUM", x)) {
    return(as.numeric(sub("% AUM", "", sub("based on ", "", x))) / 100)
  }
  if (!grepl("\\d", x)) {
    return(trimws(x))
  } else {
    if (grepl("percentile", x)) {
      return(x %>% str_replace_all("[^0-9\\.]", "") %>% as.numeric() / 100)
    } else {
      if (grepl("updated on", x)) {
        r <- sub("Last updated on ", "", x)
        r <- paste(unlist(strsplit(r, "/"))[2], unlist(strsplit(r, "/"))[1], sep = "-")
        return(anytime::anydate(r))
      } else {
        return(as.numeric(x))
      }
    }
  }
}

The above extracts most of what we want. The function below finishes the job. It parses the “Product Involvement Areas” table — which indicates involvement in controversial activities — and returns a list of those marked “Yes.” This data is only available for companies, not mutual funds.

fun_lists <- function() {
  x <- page.i %>%
    html_nodes(xpath = '//*[@id="Col2-3-InvolvementAreas-Proxy"]/section/table') %>%
    html_table() %>%
    data.frame()
  n <- sum(grepl("Yes", x[, 2]))
  if (n == 0) return(NA)
  if (n == 1) return(x[grep("Yes", x[, 2]), 1])
  if (n >= 2) return(list(x[grep("Yes", x[, 2]), 1]))
}

The last one is a just helper function, a convenience wrapper around the paths_allowed() function in the robots.txt library. It takes any URL as an argument and parses out info needed for subsequent arguments. With it, we can use a single function for testing robots.txt regardless of the target’s URL. It’s especially time-saving if scraping multiple sites.

fun_robots <- function(url = link.i) {
  base_url <- paste0(url_parse(url)$scheme, "://", domain(url))
  paths_allowed(
    paths = sub(base_url, "", link.i), 
    domain = domain(url), 
    bot = "*"
  )
}

Step 1c: Write User Agent

Transparency is the first bullet point in my introductory sermon on scraping ethics. In this step I create a custom user agent string. It can be anything but should at least include a name and contact info. It’s also nice to state something about what you’re doing (i.e., purpose of scraping). Doing this accomplishes two main things:

  1. It doesn’t absolve you of responsibility but it starts you off on better footing. Sort of like a cop uncle giving you a sticker for your car window; you’re still going down if you murder but you might get a break for rolling through a stop sign.
  2. It humanizes your activity and gives the website’s system administrators incentive and means to contact you if there’s an issue. Otherwise, it’s too easy for them to just blacklist your IP address, even if your activity isn’t really blacklist worthy.

Just for kicks, here’s my default user agent (your could be different)…

httr:::default_ua()
## [1] "libcurl/7.59.0 r-curl/3.3 httr/1.4.0"

And here I establish a custom user agent string variable to be used in a later step…

var_agent <- "Kyle Rudden (me@kylerudden.com). Doing personal research."

Step 2: Create Data Tables

This is where things start to diverge a little depending on whether we’re scraping ESG data for companies or funds.

Before scraping, we’ll create a table — one each for companies and funds — containing pre-populated meta data (e.g., company or fund name) and empty (placeholder) columns for ESG data to be downloaded.

Step 2a: Create Data Table (Companies)

The S&P 500 is a good universe to work with — broad enough for meaningful follow-up analyses and its constituents are large enough to have likely been scored by Morningstar/Sustainalytics (i.e., there will be ESG data to scrape).

The code below fetches a table of S&P 500 constituents from Wikipedia. I usually avoid Wikipedia but: 1) everyone can access it, 2) it’s another scaping example, and 3) accuracy doesn’t matter for this post (any tickers will do).

# Note: ^GSPC is the symbol/ticker for the S&P 500 Index
dat_stocks <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
  html_nodes("table[id='constituents']") %>%
  html_table() %>%
  data.frame() %>%
  as_tibble()

Notice that I employ forward-piping using the “%>%” operator from the magrittr library (also from the tidyverse, and loaded as an rvest dependency). Piping results in less code and greater readability. You can read about it here.

I now want to do three things with the raw table: 1) clean up the column names, 2) select only the columns we’ll use, and 3) rename two tickers (i.e., substitute the period in BRK.B and BF.B with a dash; Yahoo uses BRK-B and BF-B).

Inspect current column names (the Wikipedia headers include spaces, which R replaces with periods)…

colnames(dat_stocks)
## [1] "Symbol"                "Security"              "SEC.filings"          
## [4] "GICS.Sector"           "GICS.Sub.Industry"     "Headquarters.Location"
## [7] "Date.first.added"      "CIK"                   "Founded"

Rename columns, reduce table to just those we want (also re-ordering them in the same step), and fix two tickers…

# rename columns
colnames(dat_stocks) <- c("company", "ticker", "filings", "sector", "industry", "location", "added", "cik", "founded")

# select columns
dat_stocks <- dat_stocks[, c("ticker", "company", "sector", "industry")]

# rename tickers
dat_stocks$ticker <- gsub("[.]", "-", dat_stocks$ticker)

And a quick inspection (just the first five rows)…

head(dat_stocks, 5)
## # A tibble: 5 x 4
##   ticker              company sector                industry                    
##   <chr>               <chr>   <chr>                 <chr>                       
## 1 3M Company          MMM     Industrials           Industrial Conglomerates    
## 2 Abbott Laboratories ABT     Health Care           Health Care Equipment       
## 3 AbbVie Inc-         ABBV    Health Care           Pharmaceuticals             
## 4 ABIOMED Inc         ABMD    Health Care           Health Care Equipment       
## 5 Accenture plc       ACN     Information Technolo~ IT Consulting & Other Servi~

Below I add placeholder columns for ESG data that we’ll acquire shortly. I should note that creating these empty ESG columns isn’t necessary, nor is defining their column types (e.g., integer) as I do. You could just let the loop create the new columns. Doing it this way just makes it easier to see what’s coming.

dat_stocks$esgRating    <- as.character(NA) # ESG Rating
dat_stocks$esgScore.tot <- as.integer(NA)   # ESG Score (Total/Overall)
dat_stocks$esgScore.env <- as.integer(NA)   # ESG Score (Environmental)
dat_stocks$esgScore.soc <- as.integer(NA)   # ESG Score (Social)
dat_stocks$esgScore.gov <- as.integer(NA)   # ESG Score (Governance)
dat_stocks$esgRank.tot  <- as.numeric(NA)   # Percentile Rank (Total/Overall)
dat_stocks$esgRank.env  <- as.numeric(NA)   # Percentile Rank (Environmental)
dat_stocks$esgRank.soc  <- as.numeric(NA)   # Percentile Rank (Social)
dat_stocks$esgRank.gov  <- as.numeric(NA)   # Percentile Rank (Governance)
dat_stocks$conRating    <- as.character(NA) # Controversy Rating 
dat_stocks$conLevel     <- as.integer(NA)   # Controversy Level
dat_stocks$conAreas     <- as.character(NA) # Controversy Areas (Products)
dat_stocks$asOf         <- Sys.Date()       # Last Updated date

Step 2b: Create Data Table (Mutual Funds)

For mutual funds I’ll use a group of 50 pre-selected funds (download a CSV file here). They’re mostly SRI/ESG funds but I threw in a few that aren’t — either explicitly or aren’t rated — for the full spectrum of scraping results.

dat_funds <- read.csv("dat_funds.csv")
head(dat_funds)
## # A tibble: 6 x 3
##   ticker familyName     fundName                                       
##   <chr>  <chr>          <chr>                                          
## 1 ABEMX  Aberdeen       Aberdeen Emerging Markets Fund                 
## 2 ABALX  American Funds American Funds American Balanced Fund          
## 3 AMRMX  American Funds American Funds American Mutual Fund            
## 4 AEPGX  American Funds American Funds EuroPacific Growth Fund         
## 5 AGTHX  American Funds American Funds The Growth Fund of America      
## 6 AWSHX  American Funds American Funds Washington Mutual Investors Fund

Similar to above, I add placeholder columns for ESG data that we’ll acquire shortly. The columns for mutual funds are slightly different than for stocks/companies.

dat_funds$esgRating    <- as.character(NA) # ESG Rating
dat_funds$esgScore.tot <- as.integer(NA)   # ESS Score (Total/Portfolio)
dat_funds$esgScore.env <- as.integer(NA)   # ESG Score (Environmental)
dat_funds$esgScore.soc <- as.integer(NA)   # ESG Score (Social)
dat_funds$esgScore.gov <- as.integer(NA)   # ESG Score (Governance)
dat_funds$esgScore.aum <- as.integer(NA)   # ESG Score (% AUM basis)
dat_funds$esgScore.raw <- as.integer(NA)   # ESG Score (Raw)
dat_funds$esgScore.ded <- as.integer(NA)   # ESG Score (Controversy Deduction)
dat_funds$susMandate   <- as.character(NA) # Sustainability Mandate
dat_funds$susRank.pct  <- as.numeric(NA)   # Sustainability Rank (Percentile)
dat_funds$susRank.cat  <- as.numeric(NA)   # Sustainability Rank (Category)
dat_funds$asOf         <- Sys.Date()       # Last Updated date

Step 3: Download ESG Data

Just as with creating data tables I’ll break the actual scraping into sub-steps; one each for companies and funds. There are minor differences but the process is basically the same: iterate through the data table, use the ticker to construct an HTTP request URI, retrieve the entire page, then extract specific XML nodes and their data using XPaths.

A note on XPaths… XPath is a syntax used navigate XML hierarchies. Think of an XPath as directions (a path) to a data item. XPaths are convenient but have a downside: if the site structure changes, XPaths in scaping code must be updated. More change-resistant options exist but the cost is up-front coding time. For large projects it’d be worth it but for a few ESG items XPath are just too easy to not use.

A few points about the scripts’ flow (same for companies and funds):

  • if first forms a request URL (based on ticker) and checks it with robotstxt::path_allowed(), creating a bots.i variable
  • if bots.i is true (allowed) the script continues, ironically by pausing (for 0.5 to 3.0 seconds, determined randomly)
  • the page contents are fetched and each of the ESG data items are extracted using the two functions created earlier
  • the extra steps for Controversy Level are because if the level is “None,” a different XPath applies than for other levels
  • everything is wrapped in a tryCatch() so that if there’s a problem with one iteration the whole process doesn’t crash

Step 3a: Download ESG Data (Companies)

Scraping all S&P 500 companies should only take a few minutes depending on computer, internet speed, and the hand you’re dealt with the random pauses. For me it was 2.5 minutes. If you want to speed things up, or otherwise don’t want the entire index, you could always subset dat_stocks. A few super quick examples:

By sector…

sort(unique(dat_stocks$sector))
##  [1] "Communication Services" "Consumer Discretionary" "Consumer Staples"      
##  [4] "Energy"                 "Financials"             "Health Care"           
##  [7] "Industrials"            "Information Technology" "Materials"             
## [10] "Real Estate"            "Utilities"
# energy and utilities sectors
dat_stocks <- subset(dat_stocks, sector %in% c("Energy", "Utilities"))

By sector and industry…

sort(unique(dat_stocks$industry[dat_stocks$sector == "Utilities"]))
## [1] "Electric Utilities"                          
## [2] "Gas Utilities"                               
## [3] "Independent Power Producers & Energy Traders"
## [4] "Multi-Utilities"                             
## [5] "Water Utilities"
# utilities but not water utilities
dat_stocks <- subset(dat_stocks, sector == "Utilities" & industry != "Water Utilities")

Anyway, here’s the full S&P 500 download…

i <- 1
for (i in 1:nrow(dat_stocks)) {
  message(paste0(i, " of ", nrow(dat_stocks)))
  tryCatch({
    tick.i <- dat_stocks$ticker[i]
    link.i <- paste0("https://finance.yahoo.com/quote/", tick.i, "/sustainability")
    bots.i <- suppressMessages(fun_robots(link.i))
    if (bots.i) {
      Sys.sleep(runif(1, 0.5, 3.0))
      page.i <- GET(link.i, user_agent(var_agent)) %>% content()
      dat_stocks$esgRating[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[3]/div/span')
      dat_stocks$esgScore.tot[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[1]')
      dat_stocks$esgScore.env[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[2]/div/div[2]/div[1]')
      dat_stocks$esgScore.soc[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[3]/div/div[2]/div[1]')
      dat_stocks$esgScore.gov[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[4]/div/div[2]/div[1]')
      dat_stocks$esgRank.tot[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[2]/span/span')
      dat_stocks$esgRank.env[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[2]/div/div[2]/div[2]/span/span')
      dat_stocks$esgRank.soc[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[3]/div/div[2]/div[2]/span/span')
      dat_stocks$esgRank.gov[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[4]/div/div[2]/div[2]/span/span')
      dat_stocks$conRating[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div/div[2]/div[1]/span/span/span')
      dat_stocks$conLevel[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div/div[2]/div[1]/div')
      dat_stocks$conAreas[i] <- fun_lists()
      dat_stocks$asOf[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/span[2]/span')
    }
  }, error=function(e){})
}
dat_stocks$asOf[which(is.na(dat_stocks$esgRating))] <- NA

A quick inspection of the results shows that 95.4% of S&P 500 constituents are rated by Morningstar/Sustainalytics…

scales::percent(sum(!is.na(dat_stocks$esgRating)) / nrow(dat_stocks))
## [1] "95.4%"

Now that I think about it, I want to know the percentage of market capitalization (versus company count). First I have to download market caps (using the quantmod library). I suppress warnings due to an irrelevant time zone warning.

dat_stocks$mktCap <- suppressWarnings(
  quantmod::getQuote(dat_stocks$ticker, what = "marketCap")$marketCap
)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## downloading set: 1 , 2 , 3 , ...done

It’s about the same — 95.5% of total S&P 500 market cap is rated by Morningstar/Sustainalytics…

scales::percent(sum(dat_stocks$mktCap[which(!is.na(dat_stocks$esgRating))]) / sum(dat_stocks$mktCap))
## [1] "95.5%"

Step 3b: Download ESG Data (Mutual Funds)

Again, it’s basically the same process — some slight variation in data points and XPaths and one additional check. The new “if” statement checks a particular XPath for the presence of “ESG” as a test; it’ll return FALSE if there’s no ESG data.

Often there is more recent ESG info on Morningstar’s site. However, Morningstar’s robots.txt requests we not be scrape it. Just be aware of the occasional lag between morningstar.com and yahoo.com. This seems to be more pronounced for funds than for companies.

i <- 1
for (i in 1:nrow(dat_funds)) {
  message(paste0(i, " of ", nrow(dat_funds)))
  tryCatch({
    tick.i <- dat_funds$ticker[i]
    link.i <- paste0("https://finance.yahoo.com/quote/", tick.i, "/sustainability")
    bots.i <- suppressMessages(fun_robots(link.i))
    if (bots.i) {
      Sys.sleep(runif(1, 0.5, 3.0))
      page.i <- GET(link.i, user_agent(var_agent)) %>% content()
      if (grepl("ESG", fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/h3/span'))) {
        dat_funds$esgRating[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[3]/div/span')
        dat_funds$esgScore.tot[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[1]')
        dat_funds$esgScore.env[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[2]/div/div[2]/div[1]')
        dat_funds$esgScore.soc[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[3]/div/div[2]/div[1]')
        dat_funds$esgScore.gov[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[4]/div/div[2]/div[1]')
        dat_funds$esgScore.aum[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div[2]/div[1]/div[2]/span')
        dat_funds$esgScore.raw[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div[2]/div[3]')
        dat_funds$esgScore.ded[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div[3]/div[3]')
        dat_funds$susMandate[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/div/span/span/span')
        dat_funds$susRank.pct[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/p[1]/span/span') / 100
        dat_funds$susRank.cat[i] <- page.i %>%
          html_nodes(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/p[2]/span/span') %>%
          html_text(trim = TRUE)
        dat_funds$asOf[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[4]/span[2]/span')

      }
    }
  }, error=function(e){})
}
dat_funds$asOf[which(is.na(dat_funds$esgRating))] <- NA

There are two fields in need of explanation. The sum of esgScore.raw and esgScore.ded (zero or a negative number for controversy deduction) should equal esgScore.tot. However, it’s often off by one due to rounding and is occasionally off by a lot more due to know who knows why. You can see that below (mostly -1 to +1, except for two extremes).

dat_funds$esgScore.tot - (dat_funds$esgScore.raw + dat_funds$esgScore.ded)
##  [1]  5 -1  0  0  0  0  0  0  1  0  0  1 -1  0  0  0  0  0  0  0  1 -1  0  0  1
## [26]  0  0  1  0  0  1 -1 -1  0  0  6 -1  0  0  0  0  0  0  0  0  0 NA NA  0  0

Step 4: Analysis

This post is about getting the data. The analysis is up to you. I’ll leave it at a couple of rough summarizations. The first is a simple roll-up of dat_stocks by sector (ESG scores are simple averages versus market-cap weighted)…

dat_look <- subset(dat_stocks, !is.na(esgRating)) %>%
  group_by(sector) %>%
  summarise(
    esgScore.tot = ceiling(mean(esgScore.tot)),
    esgScore.env = ceiling(mean(esgScore.env)),
    esgScore.soc = ceiling(mean(esgScore.soc)),
    esgScore.gov = ceiling(mean(esgScore.gov)),
  ) %>%
  ungroup()

dat_look <- dat_look[order(dat_look$esgScore.tot, decreasing = TRUE), ]
dat_look
## # A tibble: 11 x 5
##    sector                 esgScore.tot esgScore.env esgScore.soc esgScore.gov
##    <chr>                         <dbl>        <dbl>        <dbl>        <dbl>
##  1 Utilities                        67           59           71           74
##  2 Consumer Staples                 63           66           58           65
##  3 Information Technology           63           65           61           64
##  4 Materials                        62           60           62           67
##  5 Energy                           61           55           64           65
##  6 Financials                       58           55           61           57
##  7 Health Care                      58           67           51           60
##  8 Industrials                      58           57           55           64
##  9 Real Estate                      58           61           51           62
## 10 Consumer Discretionary           57           55           55           64
## 11 Communication Services           54           57           51           59

Or a similar summary of dat_funds — rolled up to the fund family level and simple score averages…

dat_look <- subset(dat_funds, !is.na(esgRating)) %>%
  group_by(familyName) %>%
  summarise(
    esgScore.tot = ceiling(mean(esgScore.tot)),
    esgScore.env = ceiling(mean(esgScore.env)),
    esgScore.soc = ceiling(mean(esgScore.soc)),
    esgScore.gov = ceiling(mean(esgScore.gov)),
  ) %>%
  ungroup()
dat_look <- dat_look[order(dat_look$esgScore.tot, decreasing = TRUE), ]
head(dat_look, 10)
## # A tibble: 10 x 5
##    familyName                esgScore.tot esgScore.env esgScore.soc esgScore.gov
##    <chr>                            <dbl>        <dbl>        <dbl>        <dbl>
##  1 Aberdeen                            53           54           52           53
##  2 Calvert Research and Man~           50           55           55           52
##  3 Morgan Stanley                      50           54           55           49
##  4 Jensen                              49           60           56           53
##  5 JPMorgan                            49           56           55           53
##  6 Parnassus                           48           54           53           52
##  7 TIAA Investments                    48           55           54           53
##  8 American Funds                      47           56           53           52
##  9 MainStay                            47           56           54           49
## 10 Pioneer Investments                 47           58           53           52

III. Conclusion

Invest responsibly. Scrape ethically. Feel free to email me with questions or just to discuss.



  1. For good introductions to SRI/ESG investing see MSCI’s ESG 101: What is ESG Investing? and the Forum for Sustainable and Responsible Investment’s SRI Basics: What is Sustainable, Responsible, and Impact Investing?

Kyle Rudden
About the Author
I am a sustainability analyst, author, and consultant. My focus is SRI/ESG investing and sustainable finance. I combine subject knowledge with a hacker mindset and eclectic technology stack to uncovering original ESG insights along roads less traveled. A core area of expertise is energy sustainability and related environmental and technology issues (e.g., grid cybersecurity). My experience includes running a global Equity Research practice at a Fortune 500 investment bank and founding an ESG investment research firm. I'm intensely inquisitive and obsessed with coding so be forewarned of my occasional 'experimental ESG' posts. me@kylerudden.com