Scraping Pro Football Reference

29 Jan 2025 • Maxwell McNeal

Introduction

As an avid sports fan and data scientist, I’ve always liked looking into the statistics behind sports. And given the very public nature of NFL games and their corresponding box scores and statistics, I was quite surprised that I wasn’t easily able to track down a good dataset for the NFL statistics that I was interested in. Specifically, I was looking for career statistics for all players, past and present, in the NFL. I knew the data was out there and had been organized by various websites, most notably Pro Football Reference, but I couldn’t seem to find anything that matched exactly what I wanted. The closest I found was a dataset, seen here, from Zack Thoutt. However, it was 7 years out of date, lacked all the columns I wanted, and the scraper found on his github profile no longer works due to some changes to how the PFR website handles its tables.

So, I decided to write a simple web scraper for PFR that would get me the data I wanted. The code for the web scraper and the resulting data can be found at this github repo. Big shoutout to Zach Thoutt for his original scraper, seen here, as I referenced it to get the foundation of my scraper.

In this blog post, I’m going to go over the resulting data from the scraper, clean it up a bit, and show some simple analysis and visualizations as examples of uses of the dataset.

Importing and Cleaning the Dataset

import numpy as np
import pandas as pd
import seaborn as sns

player_stats = pd.read_csv('player_stats.csv')

Given the often tenuous nature of web scraping, the data is likely to contain some missing values. This code tells me how many missing values each column has.

player_stats.isnull().sum().loc[player_stats.isnull().sum() > 0]

# Code Output
position 10 
height 259 
weight 259 
dtype: int64

I can see that we have 3 columns with missing values, position, height, and weight. I plan on doing analysis later using position, and it only has 10 missing values, so let’s see if I can fix these values. First, I want to take a look at the effected rows.

missing_position = player_stats[player_stats['position'].isnull()]
missing_position.iloc[:, :9]

	player_id	name	position	career_begin	career_end	active	height	weight	games_reg
661	662	Daniel Archibong	nan	2021	2021	False	6-6	300lb	2
2766	2767	Josiah Bronson	nan	2021	2022	False	6-3	295lb	8
5229	5230	Tyler Coyle	nan	2021	2022	False	6-1	210lb	3
5403	5404	Tae Crowder	nan	2020	2023	True	6-3	235lb	43
5865	5866	Michael Davis	nan	2017	2024	True	6-2	196lb	122
10062	10063	JaQuan Hardy	nan	2021	2021	False	5-10	225lb	3
10797	10798	Grant Hermanns	nan	2022	2022	False	6-7	305lb	2
16258	16259	Nate McCrary	nan	2021	2021	False	6-0	213lb	1
21560	21561	Lenny Sachs	nan	1920	1926	False	nan	nan	0
22884	22885	Chris Smith	nan	2024	2024	True	6-1	305lb	5

Looking at all the players that are missing their position, there doesn’t seem to be any pattern as to why it’s missing. It is mostly players from recent years, but there is a player from the 1920’s. There’s a mix of active and non-active players, some players have played quite a few games while others only played a couple, and only one of the ten is also missing their height and weight. Looking at the first player, Daniel Archibong, his personal Pro Football Reference page, seen here, shows that he was listed as a DT for the 2 games he played in Pittsburgh back in 2021. For some reason, his position isn’t listed in the parentheses following his name on the “Football Players with Last Names Starting with A” page, seen here. That’s why his position was not recorded by the scraper. Regardless, it is clear that Daniel Archibong was a DT, so his position can be updated as such. Given there are only 9 other players with missing positions, they can similarly be solved manually. All I have to do is update each of their positions with the following code:

player_stats.loc[player_stats['player_id'] == 662, 'position'] = 'DT'
player_stats.loc[player_stats['player_id'] == 2767, 'position'] = 'DT'
player_stats.loc[player_stats['player_id'] == 5230, 'position'] = 'S'
player_stats.loc[player_stats['player_id'] == 5404, 'position'] = 'LB'
player_stats.loc[player_stats['player_id'] == 5866, 'position'] = 'CB'
player_stats.loc[player_stats['player_id'] == 10063, 'position'] = 'RB'
player_stats.loc[player_stats['player_id'] == 10798, 'position'] = 'G'
player_stats.loc[player_stats['player_id'] == 16259, 'position'] = 'RB'
player_stats.loc[player_stats['player_id'] == 21561, 'position'] = 'LE'
player_stats.loc[player_stats['player_id'] == 22885, 'position'] = 'DL'

Now, I can quickly verify that there are no longer any missing values in position.

player_stats['position'].isnull().sum()

# Code Output
0

As for height and weight, each column is missing 259 values. I scraped both columns together with a single regex expression, which gives me the sense that all the missing values for these columns come from pages where my regex failed to find a match. This would mean that all the missing values for both columns are in the same 259 rows. I’m going to take a peek at the first 10 rows that are missing height to double check.

missing_height = player_stats[player_stats['height'].isnull()]
missing_height.iloc[:10, :9]

	player_id	name	position	career_begin	career_end	active	height	weight	games_reg
460	461	Anderson	WB	1922	1922	False	nan	nan	1
582	583	Jaby Andrews	WB-DB	1934	1934	False	nan	nan	2
782	783	Burl Atcheson	E	1922	1922	False	nan	nan	1
923	924	Backnor	C	1921	1921	False	nan	nan	1
1103	1104	Hugh Bancroft	E	1923	1923	False	nan	nan	2
1283	1284	Eddie Barnikow	FB	1926	1926	False	nan	nan	2
1326	1327	John Barsha	FB	1920	1920	False	nan	nan	3
1485	1486	Fred Beach	G	1926	1926	False	nan	nan	1
1564	1565	John Beckwith	HB	1920	1920	False	nan	nan	3
1776	1777	Eddie Benz	E	1922	1922	False	nan	nan	1

It certainly seems to be the case that every row that is missing height is also missing weight. This next code snippet can confirm this.

missing_height_and_weight = missing_height['player_id'] == player_stats[player_stats['weight'].isnull()]['player_id']
print(missing_height_and_weight.unique())

# Code Output
[ True]

Let’s look a little further into this to see if this is a problem with my scraper or a problem with the underlying data. One strange thing I immediately notice is in the first row. The player has only one name, “Anderson”, and played a single game in 1922. Their individual Pro Football Reference page, seen here, shows very little data, only that they played in one game. No full name, no statistics, no height or weight. It seems the scraper did all it could here as this page offers little value, and it should probably be removed from the data set. We can see what looks like a similar case in the fourth row, a played named only “Backnor” with one career game in the 1920’s as well. This looks like an unrelated problem, so let’s revisit this after we solve the height and weight situation.

Moving on to the second row, on Jaby Andrew’s PFR page, seen here, Jaby Andrews has a listed weight of 208lbs, but no listed height. However, the data shows both Jaby’s height and weight as missing, instead of just the height. This is because the scraper is expecting to find a height and weight regex match in the form of “X-Y, Zlb”, where X, Y, and Z are numbers. When a page is missing the height, like Jaby Andrew’s, or weight, the regex entirely fails to find a match, leaving both height and weight as NaN. So this is clearly an issue with the scraper. I could solve this rather easily by performing two separate regex matches to find height and weight independently. This would give us more complete data, but each of these 259 rows would still be missing either height or weight. So in this case, I’m going to choose to just drop all 259 columns to completely remove all missing values from the data. ¹

player_stats = player_stats.dropna(subset=['height', 'weight'])

Now we can revisit the earlier issue of rows with players like “Anderson” and “Backnor”. While those two specific instances were removed due to their missing height and weight, there will likely be more rows with a similar problem. It seems to me that the easiest way to find all instances of this problem would be to split the name column on every space, which will give us a list, then take the length of each list that results from the split. Any players whose name list is of length 1 is likely a problem.

player_stats['name_list_len'] = player_stats['name'].apply(lambda x: len(x.split()))
one_name = player_stats[player_stats['name_list_len'] == 1]
one_name.iloc[:10, :9]

	player_id	name	position	career_begin	career_end	active	height	weight	games_reg
3166	3167	Brunswick	G	1920	1920	False	5-10	182lb	1
20823	20824	Riley	G	1924	1924	False	5-11	195lb	1

Ah, another pair of players from the 1920’s who only played a single game apiece. I’m going to go ahead and drop these rows from the dataset as well.

A side note here is that the player_id I assigned to each player is simply their location in a list of all player names sorted alphabetically, where the first player has player_id = 1. I chose to start at 1 to avoid any player having a player_id of 0. However, when pandas gives indices to the rows in DataFrames, their indexing starts at 0. So “Brunswick” has a player_id of 3167, while their index in the DataFrame is 3166. This can cause off-by-one errors if not careful. I find it to typically not be an issue as long as I remembered to use the DataFrame indices when referring to specific rows, such as when dropping rows or using iloc.

player_stats.drop(3166, inplace=True)
player_stats.drop(20823, inplace=True)

print('DataFrame dimensions: ' + str(player_stats.shape))
print('Number of missing values: ' + str(player_stats.isna().sum().sum()))

# Code Output
DataFrame dimensions: (27695, 163) 
Number of missing values: 0

After cleaning the missing values and rows with only a single name from the data, the DataFrame contains 27695 rows and 163 columns.

Adding Additional Columns

I’m going to create some helper columns to make future analysis easier. The first one is going to be career_length, which will be the number of years the player has played in the league. The equation for this is career_length = career_end - career_start + 1. Adding the one is necessary to correct the off-by-one error. A player whose first year and last year in the NFL were both 2013 played one year, 2013, even though the difference between their first and last year is zero.

player_stats['career_length'] = player_stats['career_end'] - player_stats['career_begin'] + 1

The other column will be position_list. The current position column contains a string with the player’s position. This works great for players with only one listed position. However, for players with multiple listed positions, it contains each position separated by a hyphen in a single string, e.g. "DB-WR" for Deion Sanders. Performing a simple match for all players whose position == “QB” would exclude any players who have any other positions listed, even though they shouldn’t be. So I can just add a new column that splits the string in position on each hyphen and stores the resulting list. ²

player_stats['position_list'] = player_stats['position'].str.split('-')

Exploratory Data Analysis / Visualizations

With all of that done, now it’s quite trivial to perform quick exploratory data analysis on the data from the web scraper. Let’s say I want to see the average number of regular season games played by QBs across all of NFL history, and create a histogram to visualize that data. We can start by filtering the player data to only that of the players where at least one of their listed positions is “QB”.

qb_stats = player_stats[player_stats['position_list'].apply(lambda x: 'QB' in x)]
print(str(qb_stats.shape[0]) + ' QBs in the dataset')

# Code Output
1087 QBs in the dataset

We are left with 1087 players listed as a quarterback across all of NFL history. From here, it is easy to find summary statistics about regular season games played. And, using the python graphical package of your choice (in my case, seaborn), it’s very easy to quickly make visualizations, like histograms or scatterplots.

print(qb_stats['games_reg'].describe())
sns.histplot(qb_stats, x='games_reg', binwidth=10).set(title='Regular Season Games Played (QBs)')

# Code Output
count 1087.000000 
mean 46.971481 
std 57.329426 
min 0.000000 
25% 6.000000 
50% 22.000000 
75% 71.500000 
max 340.000000 
Name: games_reg, dtype: float64

The summary statistics show the mean number of regular season games played is 46.97, but the standard deviation, at 57.33, is huge relative to the mean, due to the large right skew seen in the histogram. The median of 22 is a better measure of central tendency when looking at the histogram. The histogram also illustrates that most quarterbacks appear in 10 regular season games or fewer, and as the number of career games increases, there are fewer and fewer QBs whose career spanned that many regular season games. This difference in games played causes most charts of counting statistics to look similar to the above chart, with a large bar in the start then a shrinking tail skewed right. More games means more opportunity to rack up numbers. For instance, let’s look at the histogram of career regular season passing yards.

sns.histplot(qb_stats, x='pass_yds_reg', binwidth=1000).set(title='QB Career Regular Season Passing Yards')

Due to the correlation between regular season games played and passing yards, the same trend appears in the chart, only even more exaggerated. Of the 1087 quarterbacks we started with, the chart shows that over 500 of them, or about half, never eclipse 1000 career regular season passing yards. While it’s hard to play a lot of games in the league as a quarterback, it seems to be even harder to rack up many thousands of passing yards. We can calculate the numerical correlation between career length and regular season passing yards, then use a scatterplot to visualize the relationship between both variables.

print(qb_stats['games_reg'].corr(qb_stats['pass_yds_reg']))
sns.scatterplot(qb_stats, x='games_reg', y='pass_yds_reg')

# Code Output
0.8869475559344511

These variables are strongly correlated, with an r value of about 0.88, and the scatterplot visually confirms this. However, this same trend doesn’t appear in histograms for statistics that are rates, such as completion percentage. Below is the corresponding histogram for regular season completion percentage.

sns.histplot(qb_stats, x='pass_cmp_pct_reg').set(title='QB Career Regular Season Completion Percentage')

Here see something much closer to a normal curve, with some outliers at both 0 and 100 percent for QBs that have very few passing attempts. Let’s see how different the chart looks if we institue a minimum number of passing attempts to appear on the chart. This is a common way to remove outliers in sports statistics. Pro Football Reference uses a minimum number of 1500 passing attempts to determine career leaders in passing completion, so I will use that as the cutoff.

sns.histplot(qb_stats[qb_stats['pass_att_reg'] >= 1500], x='pass_cmp_pct_reg', binwidth=1).set(title='QB Regular Season Completion Percentage (min. 1500 attempts)')

This step completely removes the outliers and shows a more representative picture of quarterback completion percentages.

Conclusion

Hopefully, this has helped illustrate the usefulness of the web scraper and dataset. I was able to extract a dataset of NFL career player statistics, clean it, and perform some basic exploratory data analysis. This dataset is useful for many types of sports analysis, whether it’s analyzing league-wide trends over time, crafting interesting visualizations, or even just creating sports trivia questions.

There is definitely room for the scraper and dataset to improve and grow, most obviously by scraping individual player seasons instead of career statistics. That additional level of granularity would open the door to analysis by age and comparing active players’ career trajectory to retired players, as well as team and season specific analysis that is currently not possible with the current dataset.

If you find this type of sports data science writing interesting, please keep checking in on the blog. One of my goals for 2025 is to regularly post interesting sports analytics stuff here. Also, if you have a github account and think the web scraper is cool or useful, feel free to drop a star on the repo. I really appreciate anyone who hasen taken the time to read all of this, thank you!

I’m also partially motivated to not update the scraper right this second due to the fact that the web scraping rate limit on PFR means that fully rerunning the scraper would take about 48 hours, and I do plan on finishing this blog post well before 48 hours from now. ↩
It’s technically stored in a pandas Series object, which behaves similarly to, but not exactly like, a base python list. In this situation, they are interchangeable. ↩