Project: The Baby Name Generator

Traversing the Social Security Administration's baby name database

Nov 02, 2021

As a newly minted 30 year old, I am entering a new phase in my life. Looking to settle down, put down roots, and some day make my dead grandparents happy by continuing the line.

Having worked with kids for many years, the subject of children’s names was a frequent topic of discussion. A name could be too common (no one wants to be relegated to having their last initial tacked on to their name), it could be too long, it could be too weird—all pitfalls that parents should try their best to avoid when naming their children.

Luckily for us, the Social Security Administration keeps track of the country’s most frequent baby names in a given year, and allows users access to this data:

So at the very least, you can avoid the most frequent names from the last few years. So when your daughter Olivia goes on The Bachelor and there’s three other Olivias? That’s on you.

For our “phase 1” project at Flatiron we were assigned to create a single-page application that uses an external database (an “API”) to do something. After some deliberation, my partner and I landed on playing around with the government’s baby name database, as there’s some fun to be had there.

Here’s what we didn’t realize: The federal government is a plodding luddite. There is no organized “API” of the data—when you attempt to download it from the government, they present you with this:

A separate .txt file for each year going back to 1880.

The most popular girl names from the year 1880. A given year’s .txt file lists all girl names by order of popularity, then all boy names. One line of text per name. And exponential population growth means exponential txt file growth: “yob1880.txt” is **2000** lines long, while “yob2020.txt” is **31,271**.

For our project, we landed on a simple “Baby Name Generator.” Users can input some search parameters, (starting letter, desired length, desired time period) and get back a name that fits.

First draft of the page’s interface and the HTML. My partner refused to accept my assertion that “1995 Web is an aesthetic.”

A pretty simple idea, but there was work to be done to make this government data-dump malleable. Before we could get to building a website in Javascript / HTML, which was the actual assignment, there was a lot of work to be done in Python to get this data formatted and uploaded to a server.

Warning: Here’s where things get exciting. But also incredibly dense.

Step 1: A function that opens the text file, and converts each line into a text string, with each string as an individual item of a array

Step 2: A function that takes each text string in our list and isolates the individual data points. “Mary, F, 7065” is now [“Mary”,”F”,7065] and each element can be accessed independent of the others.

Things were progressing well! We had our data for each year, and through Javascript these sets could be converted to “JSON Objects” to populate an online database.

Converts our data arrays to JSON objects & pushes them to a JSON database. An interesting challenge: with so much data, the names were getting pushed to the server out of order. Solved by my partner—the “await sleep(80)” line tells it to wait 80 milliseconds between each upload, not moving too fast and skipping ahead of itself.

If you’re sitting there wondering when things were going to get exciting… it’s now.

Some math: You’ll remember I said earlier that each year’s data was anywhere from 2,000-30,000+ lines of text, one line per name. To be exact, for years 1880-2020, we had 2,089,828 lines of text, items of data. And it gets worse:

In our text files, a single name, its gender and its popularity together took up one line.

But when uploaded as a JSON object, our one-line-of-text name is now 6 lines of code!

{ // Open bracket { creates new object, 1

"name": "value", // Assigns a Name value to object, 2

"sex": "value", // Assigns a Gender value to object, 3

"popularity": value, // Assigns a Popularity value to object, 4

"id": value // ID is automatically generated by server, 5

}, // } ends object, , sets up the next object, 6

So our 2,089,828 lines of text just became 12,538,698 lines of code. Throw in another 140 lines, one for each year, and this… was not manageable. Any data retrieval we tried, any search result filters we build, would have to iterate through over 12 million lines of code and over 2 million entries. We had to shrink our data set.

We had a few solutions. Our first cut our data load by nearly 9/10ths: Combine the databases for every 10 years, and let users pick their favorite decade instead of a years range.

These functions take two given sets of name data and for each name, adds together its popularity over the two sets. In the example you’ll see that in the years 1880 and 1881, there were 13984 Marys to second place’s 5302 Annas.

When two years are combined, their data isn’t quite cut in half, but it’s close! Other than a few names that appear on one list and not the other, most names combine with their counterpart, giving us around half the items of data.

But even after combining entire decades, we were still facing if not a mountain of data, quite a large hill. Our next solution was simple: instead of using the entire name database from a given decade, reduce each decade to its top x names, which not only put a hard ceiling on data items but gave us more distinct and flavorful names of the era.

There was some work to be done, but there was a solution in sight, which was a relief after the panic that comes with staring down 12 million lines of code.

The solution was a bit complicated by the way the data was structured: Every single female name by popularity, then every single male name by popularity. We could not simply take the first x entries of a given list, we first had to isolate and split the names by gender.

These functions together take a given set of name data (in the example: 1976), and return the top x (example: top 10) female and male names of that set.

From here we could finally produce our database: we chose our cutoff point as the top 100 names of each decade, first, an aesthetically pleasing number—second, that after starting at 50 names, we felt like each time we went back a few spots the names got more fun and era distinct.

This function takes in 10 years of data, adds them together, re-sorts them by popularity, then trims them down to the specified number. And there you have it: The top 100 female and male names of the 1900s.

We pushed these new Top 100 of Each Decade datasets to the JSON database, and ended up with a modest 16,824 lines of code. Quite manageable for the modern computer. With the data finally finished, we could work on… actually building the website:

Here we have the website itself. Some copy up top that was supposed to be temporary and somehow survived, and at the bottom, the real meat of the website. The search function.

We built filters for users to select between Boy and Girl names, to select a favorite decade, a starting letter preference (or no preference!), and a length preference.

Where the magic happens. Fetches the data for the user’s chosen decade, iterates through the array of data, grabs each name that matches the user’s form inputs and adds it to a new array. Then picks a random name from the list and displays it on the screen, along with statistics.

And… it works!

Debra! That’s such 1950s vibes! Very Mad Men. The website works!

Hit the button again and a new name appears. Keep hitting the button! It’s fun!

Grayson? Yikes. But! The site changes colors like a gender-reveal party. Cool.

There’s a few more features I’d like to add, that shouldn’t be too many hours of work but weren’t quite ready in time for due date:

Gender neutral name support. The government data may be archaic and regressive, but the name generator doesn’t have to be. Couldn’t quite finish it in time, but it’s almost there:

Separates into boy and girl names, then compares each name on the arrays—when one appears on both, it gets added to our new array of neutral names. Just need to add a test for frequency: for names that are commonly gender neutral and have an appreciable popularity on both lists, not just a “Boy Named Sue” situation.

Filter for recent popularity. Admittedly, something I considered one of the core filters that we just couldn’t get to work in time. One where users can specify if they want a name that’s currently Common, Rare or Very Rare, and names are compared against a database of only the last 5 years.
Actually getting it online so you can play around with it. I’m sorry that I saved that reveal for so late in the post. It’s not online. You can’t use it. It’s technically online on Github Pages but it won’t run because they don’t provide server support. I’m working on it, people!

In the end, I walked away quite happy with the page. It’s a charming little website that’s fun to play around on; it has some genuine utility, and even if doesn’t seem like some technological wonder at first glance—the process of getting it organized, up and running made this project more than meets the eye.

The Wes Wing

Discussion about this post