Utility Cybersecurity Disclosures

Text Analysis of Cybersecurity Disclosures in Utility Form 10-K Filings

By Kyle Rudden on 12/21/18 (updated 02/06/19)

Cyber risks and cybersecurity management are material investing issues, enough so that the SEC has issued disclosure guidelines. Since major successful cyber attacks on electricity infrastructure have the potential to impact all other sectors, industries, and companies I decided to analyze electric utility cybersecurity disclosures in Form 10-K filings.

I. Summary

Context Summary

This article focuses on my analysis of cybersecurity disclosures by electric utilities in their SEC Form 10-K¹ filings. It’s a follow-up to Cybersecurity & ESG Investing in which I discuss cyber risk as an investing concern and provide much of the contextual background for this post. I’d suggest reading it first but its key points that underpin this piece are:

Cyber risks and cybersecurity management are material investing issues relevant to general and SRI/ESG investing.
Cybersecurity disclosure is squarely on the agendas of regulatory bodies such as the SEC, and with growing urgency.
When assessing cyber risks, investors need to look beyond a single company and consider the full spectrum of risks.
Utilities are ubiquitous in supply chains and thus are a universal source of portfolio cyber risk, even absent holdings.

Results Summary

My goal was to get a general feel — nothing overly scientific — for utility cybersecurity disclosures by answering basic questions about: Prevalence (who is discussing cybersecurity?), Conspicuity (where within the 10-K do disclosures appear?), Substance (what’s being discussed?), and Uniqueness (how similar are disclosures?). A few summary points:

Prevalence: All utilities included some cybersecurity disclosure in all years. Importantly, the number of words related to cybersecurity increased 36.7% from 2013-2017. It’s only 0.6% of all words, but the 10-K isn’t the place for robust detail.
Conspicuity: Within the 10-K, disclosures appear in conspicuous (early and commonly-read) sections, often appear in more than one section, and largely conform to the SEC’s guidance re: location. In short, no one is “hiding” the issue.
Substance: The wording of cybersecurity disclosures is still largely generic but it’s becoming more substantive in ways that matter (e.g., rather than just listing cyber risks, companies are addressing prevention in Corporate Governance).
Uniqueness: There is still a high level of document similarity (in words, topics, and sections) across utilities, which to some extent is inevitable, but a handful of utilities are providing a greater level of company-specific information.

Though not a focus of this article, I also took a quick look at relationships between cybersecurity disclosure and: 1) ESG performance, and 2) functional roles in maintaining grid reliability. These appear at the end of the Results section.

Methods Summary

Methodological particulars appear at the end. The gist is that I employed various programming and natural language processing (NLP) methods to collect, process, classify, and analyze unstructured textual data contained in the 10-K filings of 50 electric utilities over the 2013-2017 period. The most important things to note up-front are these caveats:

The analysis herein is very specific — cybersecurity disclosures in 10-K filings of investor-owned utilities (IOUs) — and thus shouldn’t be construed as anything more, such as a statement about utility cybersecurity in general.
While IOUs account for a large percentage of the nation’s electricity infrastructure there are many large cooperatively-, municipally-, and federally-owned utilities that do not file 10-Ks but play important roles in grid security.
Form 10-Ks are just one source of cybersecurity-related data for investors, and a relatively minor one at that since SEC filings by nature (factual and historical-looking) aren’t the place for robust, forward-looking discussion.

II. Context

Cyber Risks and Cybersecurity are Material Investing Issues

Cybersecurity is a material investment consideration, both generally and for SRI/ESG investing. Cyber attacks can have major business, financial, and capital markets consequences and impact many Environmental, Social, and Governance issues. In fact, cyber risks are so relevant to investing that SEC² and FASB³ issued guidance on cybersecurity disclosures and the SASB mentions cybersecurity in the majority of its 77 industry-specific accounting/reporting standards.⁴

Full-Scope Cyber Risk Awareness is an Absolute Imperative

SRI/ESG investors are broadminded. Save for specialist strategies (e.g., clean energy fund) they generally view each leg of ESG as equally important. Moreover they are particularly concerned with, and astute at understanding, a company’s ESG profile in a truly holistic way, also taking into account the ESG impacts of its suppliers and other stakeholders.

This kind of awareness is critical vis-à-vis cyber risks which — even as far as several links away in a supply chain — have the potential to damage a company’s full-scope ESG footprint, cripple its business, and crush its stock price. If that sounds overly alarmist, just ponder the potentiality of a major attack on a critical infrastructure sector like utilities.

Utilities are a Universal Source of Supply Chain Cyber Risk

Even SRI/ESG investors with no direct utilities investments are exposed to the sector’s cyber risks. Major, successful attacks on electricity infrastructure will affect all other sectors, industries, and companies. Except for a company that doesn’t have electricity in its supply chain — if you can find one — utility cybersecurity is a universal ESG issue.

Electricity is Prerequisite to the Sustainability of Everything

Energy, most notably electricity, is a universal input and prerequisite to the sustainability of everything else; it’s essential to all activity from economic production to government functioning. If you do find a company with minimal direct reliance on electricity, it likely relies heavily on another source (e.g., natural gas) which in turn relies on electricity for extraction, transportation, and distribution.

Potential Disruptions are Huge with Far-Reaching Impacts

The utilities sector’s diverse operations span three of the sixteen critical infrastructure sectors (CIS) identified by the U.S. Department of Homeland Security (DHS) in its National Infrastructure Protection Plan:⁵

Energy: Utilities and non-nuclear power generators are big components of the DHS-defined⁶ Energy sector
Nuclear: Reactors, Materials, and Waste includes twenty-two IOUs that own 77% of U.S. nuclear generation
Dams: The Dams sector includes hydroelectric generation and utilities own over half of the nation’s total

Moreover, utility operations affect every other CIS. Major cyber attacks could spark a chain of cascading power outages leading to interruptions in other forms of energy including gasoline; potable water; food production, distribution, and refrigeration; critical communications; health and emergency services; and government including national defense.

Note: While well-placed and timed cyber attacks could start a chain of outages, severe outages are rare in the U.S. and cascading outages are usually limited in geographical scope. Occasionally however, a small initial failure can propagate across power grids — sometimes even jumping non-contiguously — to cause widespread failures. It’s a low probability but high impact event.

Utilities are in the Crosshairs and Increasingly Vulnerable

Given the potential for outsized disruption, utilities are a prime target for a host of actors and agendas. Some groups⁷ below are greater threats than others. Notably, those most interested in wreaking the kind of havoc caused by an electricity system attack are also those most capable of pulling it off (nation states, well-organized terrorist groups).

Cyber Warfare: State-sponsored cyber acts of war on other nation-states for political, strategic, or military purposes
Cyber Terrorism: Politically- and/or ideologically-motivated attacks such as by terror networks or sub-national groups
Cyber Crime: A wide range of profit-motivated attacks, from credit card fraud to corporate ransomeware and espionage
Cyber Hacktivism: A blend of hacking and activism, these actors (e.g., Anonymous) hack for political or social causes
Cyber Vandalism: Intentional or accidental, and à la the Mountain Dew-fueled teenager in a dark basement stereotype

Eager threat agents plus an expanding array of attack vectors equals vulnerability. The latter is due to growing reliance on technology and communications. Modernization of traditional utility infrastructure (“plants and wires”) puts more “cyber” in cyber-physical systems⁸ and investments in things like distributed generation, energy efficiency, and smart metering lead to a more IoT-centric industry. All of this creates more access points for the ill-inclined.

A Few Cybersecurity Leaders Aside, Utilities are Ill-Prepared

The DHS and Department of Energy (DOE) conducted a joint assessment⁹ of the utility sector’s cyber incident response capabilities and identified major shortcomings including situational awareness and impact assessment, intelligence gathering and sharing, understanding of roles and responsibilities, and shortages in cybersecurity expertise.

The DHS/DOE assessment is alarmingly accurate. To be fair, however, two things must be noted: 1) certain utilities truly “get” cybersecurity and deserve credit where it’s due, and 2) some blame for the sector’s ill-preparedness falls outside the utilities themselves. Regulation is especially relevant to the cybersecurity preparedness problem.

Utilities are regulated at different levels. For those subject to federal oversight, cybersecurity is overseen by the FERC which adopted the NERC CIP standards and has its act together. However, many utilities are state regulated. At this level cybersecurity standards are lacking or fragmented, and there’s debate re: how cybersecurity should be funded.

Cyber Incidents Targeting Electric Utilities are on the Rise

Cyber intrusions — sniffing, probing, phishing, and other precursors to more dramatic attacks — are on the rise. One recent event had real-world, physical consequences when seven natural gas pipelines fell victim to a cyber attack targeting their third-party electronic communications systems; resulting in service disruptions which had collateral effects on electric utilities. The following are just a few examples of notable activity over the last year:

As recently as January 29, 2019 the Director of National Intelligence — the head of the U.S. Intelligence Community — issued a threat assessment report¹⁰ stating that China and Russia have the ability disrupt the U.S. electricity grid.
Cybersecurity researchers reported that an advanced persistent Iranian hacking group had been targeting the industrial control systems (ICS) of electric utility companies in the U.S., Europe, East Asia, and the Middle East.
The DHS revealed that a campaign by Russian hackers in 2017 had compromised the networks of multiple U.S. electric utilities putting the attackers in a position where they could have caused extensive electricity outages.
Researchers announced that a hacking group tied to Russian intelligence services was conducting cyber reconnaissance on the business operations, ICS systems, and computer networks of several electric utilities in the U.S. and U.K.
Energy Services Group, which provides communications and data services to utilities, was the target of a cyber attack that caused five major natural gas pipelines to shut down, in turn impacting electric utilities including Duke Energy.
The FBI and DHS issued a joint technical alert warning of generalized Russian cyber attacks against critical infrastructure in the U.S. Topping the list of targets were the energy and utilities sectors, with specific mention of nuclear plants.
The FBI and DHS issued a joint technical alert specifically warning of a “multi-stage intrusion campaign” by Russia-linked hackers targeting ICS/SCADA systems at U.S. energy companies and other critical infrastructure.
The FBI and DHS announced that hackers have been targeting U.S. energy facilities including the Wolf Creek Nuclear Generating Station in a campaign bearing resemblance to the operations of a known Russian hacking group.

III. Results

My modus operandi is to present results before boring you with methodological particulars (discussed later) but for quick context the results discussed in this section are based on text analysis of Form 10-K filings by 53 electric utilities over five years. My aim was to answer basic questions about cybersecurity disclosures, categorized as follows:

Prevalence: Who is discussing cybersecurity? Which utilities do or don’t address cyber risks in 10-K filings?
Location: Where within the Form 10-K structure (Parts and Items) does cybersecurity-related text appear?
Wording: What topics are addressed? What are frequent words and themes? Are they specific or generic?
Uniqueness: How similar/different is disclosure text across companies, and for a given company over time?
Correlations: Does the quality of cybersecurity disclosures correlate with ESG scores, especially governance?

Note: Regarding the first category, prevalence, there isn’t enough substance to warrant a separate section. The short answer is 100% of utilities include cybersecurity disclosure in all years. Related issues are addressed in other sections.

Location of Cybersecurity Disclosures

My primary goal — in fact my only goal until I got carried away — was to assess the location of cybersecurity-related text within the Form 10-K structure. More specifically, the distribution and relative density of text among and within the 10-K’s four Parts and twenty-two Items. My interest in document location is really about three things:

Conspicuity: Form 10-Ks aren’t pretty, with plenty of places to bury something without getting caught by all but first-year bond analysts. The more that disclosures appear in early, commonly-read sections (e.g., Risk Factors, MD&A) the better.
Diversity: Conspicuous early mentions of cyber risks are important but not in lieu of detailed discussion in later sections, for example addressing preventative cybersecurity measures and expertise in Part III, Item 10 (Corporate Governance).
Conformity: Though not critical, some degree of adherence to the SEC’s cybersecurity disclosure guidance can’t hurt. Its advice is logical, will likely serve as a guide for investors, and could someday become more rule than suggestion.

Location of Cybersecurity Disclosures by Form 10-K Section

Chart 1 below shows the location of cybersecurity text by year and section. Color indicates relative intra-year density of cybersecurity-related text — i.e., cybersecurity-related word count for the section as a percentage of total cybersecurity word count. Dark green indicates highest density, dark red indicates least, and gray means no cybersecurity wording.

Some cybersecurity text appears in an introductory section usually labeled “Forward-Looking Statements” pertaining to a “safe harbor” provided by the Securities Act of 1933 and its amendments. It protects a company’s use of prospective language as long as a disclaimer is provided. I excluded this text from my analysis since it’s always repeated verbatim elsewhere (same with Tables of Contents).

The majority of values underlying the chart are infinitesimal — i.e., cybersecurity wording in most sections is a tiny fraction of overall wording — that they’re unimportant for this chart beyond their relativity. For perspective, the median percentage is 0.7%. However, there are few noteworthy sections and observations:

In 2017, Text related to cybersecurity disclosures appeared in fourteen of twenty-two sections (excluding introductions).
A full 92.6% of cybersecurity wording appears in just three sections: Part I, Item 1A (Significant Risk Factors), Part I, Item 1 (Business Description), and Part II, Item 7 (Management D&A), listed in descending order of percentage.
Part I, Item 1A (Significant Risk Factors) is the single largest section accounting for 72.2% of all cybersecurity text. This is a good thing; it’s a commonly-read section that appears early in the 10-K and is consistent with SEC guidance.
Part I, Item 1 (Business Description) follows at 13.7%. Some companies detail cyber risk factors in this section, thus for this analysis Item 1 is tantamount to Item 1A. Together Item 1 and Item 1A account for 85.9% of all cybersecurity text.
The third most cybersecurity-dense section is Part II, Item 7 (Management D&A), however it only contributed 6.8%. It’s closely related to footnotes in Part II, Item 8 (Financial Statements) so you could consider them together (10.3%).
Part III, Item 10 (Corporate Governance) ranks 9th (a mere 0.4%). This is unfortunate because this is where investors — particularly SRI/ESG investors — are likely to look first for substantive discussion of cybersecurity preparedness.

Chart 1: Location of Cybersecurity Disclosures by Section

Relative Number of Cybersecurity-Related Words by Year and 10-K Part/Item

Location of Cybersecurity Disclosures by Relative Position

The heat map is a good illustration of sector averages but it doesn’t fully convey the “location” story. Two things can dramatically affect the relative position of cybersecurity text within a document, and thus its conspicuity.

First, section lengths vary widely by company. A complex utility could take many pages of Item 1 (Business) to get to Item 1A (Risks); one row apart in the heat map but sometimes half a document away. Second, a few companies re-arrange sections so they don’t appear sequentially; data in early heat map rows could actually be late in the document.

A better representation of location and conspicuity is the relative position of cybersecurity text within filings, regardless of section and document length. Chart 2 plots the position of all words in each company’s 2017 10-K from start to end, with only cybersecurity-related text highlighted. For reference, heat map Items 1 and 1A are in Chart 2’s 10-25% range.

Chart 2: Location of Cybersecurity Disclosures by Position

Lexical Dispersion (Relative Position) of Cybersecurity Text in 2017 10-Ks

Wording of Cybersecurity Disclosures

I continue the analysis of disclosure wording in the next section but for now I’m just going for a cursory, high-level look at cybersecurity wording. Specifically, I’m interested in two related things which provide insight into the subject matter of cybersecurity disclosure text. In increasing level of detail they are:

Keywords: The most frequent keywords, which offer an initial glimpse at overall subject matter
Subjects: Keyword co-occurrences and clusters of co-occurrences, which provide further detail

Most Frequent Keywords in Cybersecurity Disclosures

Word frequencies alone don’t tell us a lot, but they’re a good start for a section on semantics since they give us, in a glance, a “feel” for the overall subject matter before diving into more detailed analyses.

I won’t subject you to a word cloud, just Chart 3 below instead. It shows the 30 most frequent words in cybersecurity-related disclosure text, using a keyword extraction algorithm based on contiguity of words. The RAKE Score on the x-axis is a ratio of degree (how often it co-occurs with other keywords) to frequency.

Chart 3: Most Frequent Keywords in Cybersecurity Disclosures

Word Frequency Using Rapid Automatic Keyword Extraction (RAKE) Algorithm

One more chart on simple keyword frequencies, Chart 4 below, shows the most significant keyword changes from 2013 to 2017 (2017 compared to 2013, not each year in between). Being familiar with the underlying data I can tell you that the 2017 words as associated with slightly less generic cybersecurity wording than 2013’s. For example:

The 58.9% spike in “cybersecurity” (including all of its variants) and 30.8% drop in “cyberattack” (and variants) reflects an overall shift to more proactive wording in 2017 from a more passive, risk-oriented stance in 2013.
The increase in “awareness” and “situational” can be partially ascribed to specific uses of “situational awareness” in the context of the NIST Cybersecurity Framework and NERC Critical Infrastructure Protection (CIP) Plan.
Words such as “electromagnetic” (as in “electromagnetic pulse attacks”) and “PayPal” are further examples of greater specificity, the former of a potential attack vector and the latter referring to a specific data breach (PG&E, 2017).

Chart 4: Change in Cybersecurity Disclosure Keywords

Twenty-Five Largest Keyword Frequency Differentials Between 2013 and 2017

Thematic/Topical Content of Cybersecurity Disclosures

To better illustrate the kind of word relationships referred to above (e.g., “situational” and “awareness”) — and in turn better illustrate topical content — I made a word network graph (Chart 5 below). Also called a word similarity graph, it shows: 1) keyword relationships, and 2) topic categories. I know it’s busy; it’s a network of every single word, in every 2017 10-K. Don’t overthink it or try to discern every word and line, just get an overall feel for themes and relationships.

The chart is based on word co-occurrence (cosine similarity normalized co-occurrence to be exact), a measurement of which words/lemma appear in close proximity to each other, and do so frequently. High co-occurrence is indicated by word proximity (e.g., all those overlapping words) and the density of connecting lines. Topic categories are determined by statistical keyword clusters and are indicated by color.

Chart 5: Themes and Relationships in Cybersecurity Disclosures

Keyword Co-Occurrences and Topic Clustering in 2017 10-K Filings

Uniqueness of Cybersecurity Disclosures

Form 10-Ks aren’t the place for exhaustive detail on cybersecurity and certain sections call for generic wording. For example, Item 1A (Risk Factors) is to include only a brief listing of risks; details should appear in other sections. Nonetheless, to be of value to investors, the aggregate of a company’s disclosures need to be detailed and company-specific. I could spend an entire article on this subject but for now I looked at document similarity and lexical diversity.

Similarity of Disclosures Across Companies

Chart 6 shows a hierarchical clustering of cybersecurity-related text. Like many charts in this article, it’s intended to tell a high-level story at a glance. The key point is that there is perceptible dissimilarity (uniqueness) within the sector. That’s good. However, there are obvious clusters of high document similarity among companies that are themselves quite different. In other words, it’s highly-generic cybersecurity language that produces high cosine similarity metrics.

Chart 6: Similarity of Cybersecurity Disclosure Wording

Hierarchical Clustering of Cybersecurity-Related Text in 2017 10-K Filings

Diversity of Wording Within Disclosures

Whereas Chart 6 indicates levels of similarity/difference across companies, Chart 7 below shows lexical diversity within company’s filings. Higher lexical diversity equates to more varied wording, which implies greater level of detail.

Chart 7: Diversity of Cybersecurity Disclosure Wording

Lexical Diversity of Cybersecurity-Related Text in 2017 10-K Filings

Relationships With Other Relevant Variables

This article’s focus is text analysis of cybersecurity disclosures. However, given my central tenets — cybersecurity is a major SRI/ESG investing issue and utility cybersecurity/grid reliability is a universal concern — I conclude with a quick look at relationships between cybersecurity disclosure and: 1) ESG scores and 2) functional roles in grid reliability.

Cybersecurity Disclosure and Sustainability/ESG Performance

To paraphrase my article Cybersecurity & ESG Investing, the current cybersecurity/ESG dialogue focuses heavily on Governance. This is a fair near-term prioritization since cyber risk management is a Governance issue and attack prevention is the first priority. Cybersecurity impacts all aspects of ESG, however, and utility cyber risks in particular can have major Environmental and Social consequences.

Cybersecurity disclosure and ESG scores are scaled to a relative 100-point scale. Disclosure scores are based on word density. ESG scores are the average of two data sources: 1) CSRHub and 2) Sustainalytics (as published on Yahoo Finance’s sustainability page). The number of data points exceeds the number of utilities because each utility’s E, S, and G score is plotted. Fifty utilities times three scores.

Chart 8 shows the relationship — or more accurately the lack thereof — between cybersecurity disclosure and ESG. If there were a material relationship the upper right quadrant would be the most populated. It’s the opposite; only 18.4% of scores fall into this corner. It’s worth noting that half of the upper right scores are Governance, i.e. there is some degree of relationship between good governance and cybersecurity disclosure.

Chart 8: Cybersecurity Disclosure and ESG Performance

Cybersecurity Disclosure Density and Sustainability/ESG Scores

Cybersecurity Disclosure and Roles in Grid Reliability

The second tenet is that utility cybersecurity is relevant to all investors, regardless of ownership, because electricity is a universal input. It’s the low-probability/high-impact scenarios — cascading and prolonged power outages — that are of universal concern. So I figured it’s worth looking at cybersecurity disclosure relative to roles in grid reliability.

Chart 9 includes some subsidiaries of the 50 utilities analyzed throughout this article and not all of those 50 appear in the chart. Entities plotted are those registered with NERC as bulk-power system users, owners, and operators responsible for specified reliability functions. The underlying measure of relative influence (e.g., megawatts, transmission lines) depends on function.

Chart 9 plots cybersecurity disclosure (the same y-axis measure as in Chart 8) and a measure of relative influence in grid reliability. Colors indicate functional role, defined below, in maintaining bulk-power system reliability. Several influential entities in important reliability roles perform well in cybersecurity disclosure but otherwise there doesn’t appear to much of a story, at least not at first glance. This another area ripe for further, future analysis.

Balancing Authority: Integrates resource plans, maintains load-interchange-generation balance
Generation Owner: Owns and maintains electricity generating plants that support system reliability
Planning Authority: Coordinates electricity transmission systems, service plans, and protection systems
Resource Planner: Develops long-term plans for resource/load adequacy within a Planning Authority area
Transmission Owner: Owns and maintains electricity transmission assets that support system reliability
Transmission Planner: Develops long-term plans for interconnected electricity transmission systems

Chart 9: Cybersecurity Disclosure and Reliability Functions

Cybersecurity Disclosure Density and NERC Bulk-Power System Reliability Function

IV. Methods

I analyzed the text contained in Form 10-K filings by electric utilities over the last five fiscal years (2013-2017), using various natural language processing (NLP) techniques to parse, classify, and analyze cybersecurity content. I won’t bore you with details but a few quick points about universe, data, and methodology are useful.

Company Universe

My universe is 50 publicly-traded “electric utilities” — pure electric utilities, diversified electric utilities (electric-leaning with some natural gas), electricity transmission (one company, ITC Holdings, acquired by Fortis in 2016), and power generators (pure independent generators and generation-leaning diversified utilities).

Foreign companies in my analysis are those with major investments in U.S. electricity infrastructure and U.S.-listed equity or ADR. These are National Grid (U.K.) and three Canadian companies: Algonquin Power, Emera, and Fortis. I exclude a few foreign companies with only minor investments in U.S. electricity assets.

Data/Text Sources

To state the obvious, the source of text data for all analyses in this article is company 10-K filings, downloaded from the SEC’s EDGAR system. As a general rule I analyzed parent-level 10-K filings (my focus is on equity investing). There are two exceptions to that rule: 1) foreign companies, and 2) Berkshire Hathaway Energy.

Foreign Companies

U.S.-domiciled companies file a 10-K but foreign companies file different forms at the parent level. Specifically, the three Canadian companies mentioned above file a 40-F and National Grid files a 20-F. Although 40-F sections roughly mirror the structure of the 10-K, it’s not perfect and the 20-F is altogether different. Thus, for foreign owners I analyzed subsidiary-level 10-Ks. For example, for Fortis I analyzed the 10-Ks of ITC Holdings, Tucson Electric, and UNS Energy.

Berkshire Hathaway

I did the same for Berkshire Hathaway albeit for a different reason. Berkshire Hathaway Inc. (NYSE: BRK.A) files a parent-level 10-K but the holding company is just too fundamentally different from its utility investments to make any text analysis of its 10-K relevant and comparable for this article. Instead I analyzed individual 10-K filings for each of Berkshire Hathaway Energy’s holdings: MidAmerican Energy, Nevada Power, PacifiCorp, and Sierra Pacific Power.

Analytical Process

This section details the analytical process up to Step 5 in the list below (the primary text analysis is already discussed throughout the “Results” section above).

Data Acquisition: The first step was to acquire the data by downloading filing documents from the SEC’s EDGAR system.
Cleansing/Munging: This labor-intensive step wrangles messy raw text into a format suitable for downstream analysis.
Pre-Processing: Pre-processing is related to cleansing but focuses on higher level data transformations specific to NLP.
Classification: This binary classification step indicates whether text is “about cybersecurity” or "not about cybersecurity.
Primary Analysis: The primary analysis of cybersecurity-related 10-K text as discussed throughout the prior section.

Data Acquisition

EDGAR offers documents in HTML, plain text, and XML formats. Only the first two are relevant to text analysis and I used the HTML version since it has what I need without much extra. The plain-text version is a dump of every format, HTML included, into a single file. It has uses but not for this analysis; it’s redundant and unnecessarily taxing on resources.

The XML format is specific to XBRL instances of 10-K submissions. While XBRL — highly-structured financial and other quantitative data — is a godsend for traditional kinds of analysis but contains little text and is of limited use for NLP. As a side note, early adopters are using a new hybrid format called in-line XBRL (iXBRL) which combines HTML and XBRL into a single document.

The data acquisition process itself was largely automated with scripts I wrote to download filings and perform checks (e.g., file sizes) and inventories (companies and years). Some manual preparation went into creating a predetermined list of CIKs, the unique identifiers for SEC filers which are used to fetch documents from EDGAR.

In theory I could fully automate the process using SIC codes. Instead of manually curating CIKs I could use a script to collect those associated with each SIC code under the “utilities” umbrella. This could work for a multi-sector analysis but here I relied on my sector knowledge; a bit of informed manual intervention is good for analytical integrity since:

SIC codes aren’t perfect. Some company classifications are stale and would be different if re-classified based on current operations. Moreover the codes themselves (e.g., “Electric Services”) aren’t reflective of today’s differentiated sector.
CIK codes aren’t perfect. There’s nothing wrong with CIKs per se but rather how they’re used in special situations. For example a merger could, depending on timing, create a brief window where most recent filings are under old CIK codes.

Cleansing/Munging

It’s said data science is 80% cleansing, 20% analysis. This feels about right but it’s 80% well-spent because “garbage in, garbage out.” Believe me there’s a lot of garbage in 10-K filings, made worse by rampant inconsistencies in formatting. The first of two data preparation stages, this step entails: 1) cleansing messy text, and 2) parsing it into 10-K sections.

The messiness applies to the underlying markup and text analysis thereof. Otherwise, documents are fine and readable when viewed. Disorderly markup has many causes. Foremost, SEC rules apply to substance not style (as they should). Also, companies use different software to prepare filings. Some do it manually. Text is often copied/pasted from glossy annual reports, formatting characters and all.

Cleansing Messy Text

Before dealing with HTML structure the actual text was cleaned and standardized to create a clean but still “original” copy. In its raw state 10-K text is a rat’s nest of different character encodings and sets (ASCII, UTF-8, Unicode, HTML Entities), crude formatting hacks, and pesky meta data (e.g., page numbers). Again, this doesn’t affect reading but is problematic for text analysis. For example the three items below look the same but underneath they’re not.

Item 1A: “Risk Factors”
Item 1A: “Risk Factors”
Item 1A: “Risk Factors”

Parsing Into Sections

Although I analyzed other things my main interest is cybersecurity disclosure vis-à-vis 10-K sections. So a major step was parsing highly-unstructured text, contained in a semi-structured format (HTML), into 10-K Parts and Items. After cleansing, sections were relatively easy to determine programmatically, with a little manual help when needed (usually when sections appeared out of order).

Pre-Processing

Additional pre-analysis processing (hence “pre-processing”) is needed. Some is further cleansing but core tasks are more substantive transformations fundamental to NLP. The main steps I performed fall into three categories: 1) organizing documents for analysis, 2) distilling text to its essence, and 3) annotating text with linguistic information.

Organizing Text Into Corpora

Thus far I’ve moved unstructured text from HTML documents into data frames. Better, but not best. Here I aggregate data frames into corpora, the organizational backbone of computational linguistics. A corpus is a structured collection of related texts that contains lexical, morphosyntactic, semantic, and pragmatic (linguistical) information. Each corpus (one per company) contains the original cleansed text for reference plus a copy to undergo additional processing.

Distilling Text to Its Essence

At this point I reduce the text down to its semantic core for more informative analyses. I start with basic housekeeping such as stripping out a variety of non-relevant text like phone numbers, website URLs, and the like. The most important part of this process, however, is stop word removal. Stop words are of little (even negative) value to linguistic analysis and include extremely common generic words (“and,” “but,” “the,”) as well as project-specific words, acronyms, etc.

Annotating Text with Meta Info

Text annotation is a standard NLP terms, but the word “annotation” understates the depth and importance of the process. In short, text in an annotated corpus is tagged with meta data that provides key information as described above (lexical, semantic, etc.). Annotation provides relevancy context, without which we’d just have a bunch of text. The three main aspects to my annotation process for this project are:

Tokenization: Tokenization breaks text into successively more granular units (document, paragraph, sentence, word). Tokens, the semantic units used for linguistic analysis, are often words but not always (mine are lemma as per below).
Lemmatization: Lemmatization is a text normalization method that converts different words with fundamentally the same meaning to a single form — e.g., “run,” “runs,” and “ran” share a lexeme and are converted to the lemma “run.”
POS Tagging: Parts-of-speech (POS) tagging assigns parts of speech to each token. This can be as simple as noun, verb, adjective, etc. for basic analyses, or more fine-tuned such as verb (transitive), verb (intransitive), interjection.

Classification

I’m finally emerging from that “80% preparation” into the “20% analysis” (although technically this step is a little of both). The goal here is simple binary classification — i.e., whether or not text is about cybersecurity. I assign boolean (true/false) indicators to paragraph- and sentence-level tokens, and later extrapolate up to sections.

In almost-plain English, my hybrid approach used semi-supervised machine learning algorithms to do the heavy lifting and a keyword-based technique for fine tuning re: sector-specific cybersecurity terms, acronyms, organizations, etc. It involved curating keywords/topics, training classification models, performing classification and tweaking results.

Curating Keywords and Topics

Though by no means the crux of my analysis, I do use keywords and phrases for certain tasks in subsequent steps. I created an initial list from my knowledge of the space then, to augment that list, wrote a crawler to gather text from relevant sources (documents, websites) and extract prominent keywords using an LDA topic model with VEM algorithm.

Training Machine Learning Model

Machine learning is often overkill; in many cases simpler keyword-based classification provides similar results without the overhead. For this case, there are too many sector-specific variations in vernacular to rely on specific words alone. To identify all instances of cybersecurity-related text I needed to first train a machine learning classification model.

By feeding the model enough manually pre-labeled text in two flavors — “cybersecurity” and “not cybersecurity” — it learns what cybersecurity text sounds like regardless of specific words. Once trained it can assign a “cybersecurity-ness” value to any new text. The “semi” in this semi-supervised method pertains to manual pre-labeling of training data. I used the most prototypical keywords curated earlier to extract and label training data/text.

Main Binary Text Classification

I then used the classification model to label 10-K text as cybersecurity-related or not. As stated earlier I did this at two levels, sentence and paragraph, and in later analytical steps extrapolated upwards to Items and Parts. I ran sentences and paragraphs through the model — rather than just sentences and extrapolating up to paragraphs — since 10-K formatting can produce odd paragraph/sentence parsing results, typically with bulleted lists.

Sub-steps during training assured overall model accuracy but I did some checking and tweaking around a few particular keywords. For example, text in Item 1A (Risk Factors) was occasionally mis-classified as cybersecurity-related because it included enough wording along the lines of “attack,” “breach,” “malicious,” etc. even though they pertained to purely physical threats. My analysis included cyber-physical threats but there had to be a cyber element.

The Form 10-K is an annual report required by U.S. companies with assets greater than $10 million and a class of equity securities — publicly or privately traded — held by more than 2,000 owners. This filing requirement also includes quarterly 10-Q and other periodic reports (e.g., 8-K). The content of 10-K filings is often similar to that of companies’ “glossy” annual reports, but SEC filings are more detailed and must adhere to strict rules governing topics, structure, and language.↩
See Commission Statement and Guidance on Public Company Cybersecurity Disclosures for the SEC’s 2018 guidance, which is based on prior guidance issued by its Division of Corporation Finance in Disclosure Guidance: Cybersecurity and Division of Investment Management in Cybersecurity Guidance. The SEC’s 2018 guidance expands on statements by its divisions and addresses two new topics: 1) cybersecurity policies and procedures, and 2) insider trading in the cybersecurity context.↩
See Electronic Distribution of Business Reporting Information. The Financial Accounting Standards Board (FASB) is a private non-profit organization that governs the Generally Accepted Accounting Principles (GAAP) used by companies in the United States.↩
See SASB Industry Standards: A Field Guide for general background and SASB’s downloads page for details on its industry-specific standards.↩
See the DHS’s National Infrastructure Protection Plan for general information and Critical Infrastructure Sectors for sector-specific information.↩
DHS’s Energy sector is all-inclusive but in the financial world Energy and Utilities are distinct sectors within which are industries. Electric utilities and independent power generators are separate industries in the Utilities sector but for simplicity’s sake I’ll refer to them collectively as “utilities.”↩
There is a lot of overlap among those categories. For example, all constitute cyber crime. Cyber hacktivists often use cyber vandalism (e.g., website defacements) as a tool. Cyber terrorism is sometimes intertwined with nation-state agendas.↩
A cyber-physical system is one where physical and electronic components are so inextricably intertwined that the system wouldn’t function without one or the other. Typical examples include power plants, pipelines, and water treatment facilities but also include less industrial systems such as medical monitoring.↩
See Section 2(e): Assessment of Electricity Disruption Incident Response Capabilities of Executive Order 13800 Strengthening the Cybersecurity of Federal Networks and Critical Infrastructure.↩
See Worldwide Threat Assessment of the U.S. Intelligence Community.↩

About the Author

I am a sustainability analyst, author, and consultant. My focus is SRI/ESG investing and sustainable finance. I combine subject knowledge with a hacker mindset and eclectic technology stack to uncovering original ESG insights along roads less traveled. A core area of expertise is energy sustainability and related environmental and technology issues (e.g., grid cybersecurity). My experience includes running a global Equity Research practice at a Fortune 500 investment bank and founding an ESG investment research firm. I'm intensely inquisitive and obsessed with coding so be forewarned of my occasional 'experimental ESG' posts. me@kylerudden.com