This project is not peer-reviewed. There are a few things to take into account when you read this article:
- As of October 2018, Twitter has 326 million monthly active users, which is a very small fraction of the world’s population. Twitter users in each city might not be representative of that city.
- The amount of data I used is also small and subjected to sampling bias.
- Some of the methods used are keyword-based which are sensitive to spelling variations.
- This project undoubtedly reflects my biases. I’m also not from the US and might be oblivious to many aspects of politicial correctness. I’d appreciate any feedback you might have!
I’d like to thank Fernando Pereira for pointing out the shortcomings of this project.
Have you ever been to a party in which someone mentions that they just came back from another city and someone else feels the urge to compare said city with the current city they are in? Something along the lines of: “Oh I love London. It’s so much more diverse than the Bay Area.” Or some people start a heated argument about which city is the best. If you’re worldly and don’t have your own list of five favorite cities ready to throw at random strangers, are you even worldly?
I’ve always wondered about the validity of those remarks. Generalizing from personal experiences is a slippery slope and prone to stereotyping. So I thought: “Why not use data to compare cities?” I’ve also had the suspicion that the Bay Area is a peculiar place. I’m hoping that data can show me exactly how peculiar.
This post consists of the following sections:
- How people in different cities describe themselves
- What people talk about
- Unique emojis in each city
- The most unique city
The only source of public, semi-geo-tagged data I could think of was Twitter. I started with a seed number of users in various locations extracted from the Kaggle dataset Sentiment140 and used the Twitter API to find these users’ followers. I picked the following 13 English-speaking metropolitan areas, mostly because of the availability of data. I got 96k users from these 13 areas.
- US Cities (9): Atlanta, Austin, Bay Area, Boston, Chicago, Washington DC, Los Angeles, New York City, Seattle
- Australian Cities (2): Melbourne, Sydney
- Canadian Cities (1): Toronto
- UK Cities (1): London
I also found 223K users that aren’t in these areas which I collectively put in “Other”. These users come from a wide range of locations all over the world, both rural and urban areas.
For each user in these 13 areas, I collected the following information:
- Bio (as of April 2019)
- Tweets with timestamp, retweet, favorite count
For users in “Other”, I only collected their bios and not their tweets (because it would take forever).
For data processing, I tokenized the text with spaCy, removed stopwords, and replaced urls with
I only show a selected number of visualizations in this post. Code, anonymized data, and more visualizations for this blog post can be found on GitHub here. If you clone the Jupyter notebook keywords from the GitHub repo, you can enter a keyword to rank cities by the frequency of that keyword and compare how popular a set of keywords are within a city.
How people in different cities describe themselves
My first attempt was to try to create a word cloud of each city based on each token’s frequency. I used Andreas Mueller’s wonderful word_cloud library.
There are clear differences among cities. For example, while music is a big part of any city, it’s a comparatively small part of the bios from the Bay Area and Boston, and completely absent in Washington DC. I’ve always thought it strange that people use “music” to identify themselves. Who doesn’t listen to music? How Bay Area of me.
A sad thing to note is that “love” is prominent in all cities except Washington DC. Some of the common words in DC bios include “opinion”, “view”, and “tweet”, which come from disclaimers like: “Tweets/views/opinions are my own.”
However, the differences aren’t that easy to spot because of many common words. I thought it would be more revealing to plot only the differences between two cities. For each pair of cities, I subtracted the words in common from both vocabularies, and visualized the words that are left in a pair of word clouds.
The first thing I notice is that the differences are startling and very much enforce the stereotypes we have about each city. The Bay Area leans heavily towards tech and entrepreneurship, Washington DC politics, Seattle software engineering (dominated by Amazon and Microsoft, and not so much by startups). LA is all about that ‘Netflix and chill’ life. The homogeneity of the Bay Area and DC is rather depressing to look at. I also find it funny that the biggest word in the Bay Area is ‘product’ and the biggest word in LA is ‘producer.’ This reminds me of the saying: “If you’re not paying for the product, you’re the product.” We’re all just products in the Bay.
Given that Google, Snapchat, and Tinder all have offices in LA, there’s a surprising lack of tech in the city. LA loses to almost all cities, even Atlanta, when it comes to tech.
People have told me that Austin feels like a smaller version of the Bay Area because of its vibrant startup scene. Looking at the chart, I’d say Austin is a better version of the Bay. In Austin, people still care about family, music, and art. In the Bay, it’s all tech.
Comparing each city to the rest of the world reveals a dimension that is missing in almost all cities: religion. For example, if you compare the Bay Area to the rest of the world, some of the keywords that are most frequently used in “Other” but not in the Bay Area are: “god”, “jesus”, “christ”, “christian”. A keyword that is uniquely missing in the Bay Area bios is: “old”. Ageism is real. Nobody wants to admit to being old here. The lack of keywords like “mother”, “wife”, “kids”, and “married” in the Bay is what I take as a given.
From this dataset, we can figure out where CEOs, founders, VCs, and coffee drinkers live. You’re right, they are all in Bay Area.
The Bay Area isn’t doing that badly for many non-tech careers. We’re behind entertainment hubs like NYC, London, and LA, but we’re doing much better than cities like Boston and Austin. This area also has the highest number of students.
What people talk about
For the tweets, I subsampled 2M tweets from each city in order to make it fair to compare across cities (also faster for creating visualizations). Since I don’t have tweets from users outside these 13 metropolitan areas, for the “Other” category, I combined tweets from all cities.
Again, when I created word clouds for cities separately, the differences are harder to notice as most clouds are dominated by common keywords: ‘love’, ‘work’, ‘GOT’ (ahhh mother dragon why), ‘friend’.
When I visualize only the differences between two cities, the differences are usually dominated by the local sport teams, the local newspapers, some minor pronunciation variations across countries (‘center’ vs ‘centre’, ‘mom’ vs ‘mum’).
Atlanta stands out for mixtape, Horoscope, repurposed NSFW self-identifying words, and a surprising lack of talk on Trump.
From the graphs, the biggest local sport teams in the Bay might as well be API and Android.
Coming from the Bay where public transportation is a pain and often serves as a conversation starter, I used to think that complaining about the local transportation is an indispensable part of urban life. I was wrong. There are cities that take an interest in their local public transportation such as Chicago (CTA, METRA), New York (MTA), and Toronto (TTC), but most cities seem oblivious to the struggle of people in the Bay. There are two possible scenarios:
- The public transportation in that city is so great there’s not much to complain about.
- The public transportation in that city is so bad people just don’t use it.
I ranked the cities based on the frequency of the keyword “transportation” in their tweets and saw that the cities that are most upset with their transportation include the Bay Area, Toronto, Austin (the traffic), and Chicago. The high frequency of the keyword “transportation” in DC is likely due to a lot of talk on transportation policy, but a friend who used to live there told me there’s a non-trivial amount of complaining. The reason that non-American cities like Melbourne, Sydney, and London rank low on the list might be because they use different words when talking about transportation, such as “transport” or “transit”.
A look at DC’s word clouds compared to the rest of the world makes me deeply concerned for the wellbeing of DC residents and their feline population. DC and Melbourne somehow are the only two cities that talk more about cats than dogs.
As expected, Bay Area beats everyone else by a long shot for tech talk. Unexpectedly, Atlanta ranks second for AI. Could it be that “AI” means something else there?
People of the Bay Area are concerned about social issues too. We’re more environmentally conscious than all other cities except Washington DC. By comparison, Atlanta is the least susceptible to environmental discussions. We talk about the ‘homeless’ even more than Washington DC, a city filled with people whose jobs are to talk about policies on homelessness. The housing crisis in the Bay is worrisome. We should learn from Melbourne and Sydney, two cities that don’t have much to complain about regarding homelessness. It’s expected that the Bay dominates the talk on privacy – we created the problem in the first place.
We’re also undisputed champions in talking about weed, burrito, crossfit, and boba, beating even LA. The rest of the world really needs to hop on that boba train.
Let’s take a closer look at what Bay Area people care about. Hint: vegan > racism.
Unique emojis in each city
For this task, I used all 180M tweets from all 13 areas. I get the frequencies of emojis that appear at least 5 times in each city, and pick the emojis with frequencies at least twice as high compared to all cities. Of course, the Bay Area is winning with rockclimbing, scooter, and flipping the bird, NYC with no pedestrians, London with music and alcohol, DC with bald eagle and Pinocchio (thanks to the WashingtonPost fact checker’s Pinocchio ratings), and Chicago with peanuts. Emojis in the pictures below are ranked from most to least used.
The most unique city
When I told Jessie about the project, she said: “Okay that’s great but what is the most unique city?” Since each city has different things to offer, I thought it would be more telling to measure a city’s uniqueness by what is missing in that city. For each city, I picked the words that appear at least twice as frequently in the tweets of all other cities. Then I ranked all cities by the sum of frequencies of those words. Sadly, the Bay Area wins again. While we’re leading in talks on tech, entrepreneurship, the future, and social issues, we seem to be lacking in all other aspects. Melbourne, Sydney, and London rank high on the list but it could simply be because they have different spellings compared to other cities in this dataset, which are mostly American.
This project took me an ungodly amount of time, but also gave me the ungodly pleasure of seeing my suspicions confirmed (I have to admit that I might have chosen the kind of analyses that helped confirm my suspicions). Yes, the homogeneity of the Bay Area is appalling. Yes, my social life is dominated by techies who smoke pot, obsess over burritos, and introduce themselves as “previously @google @facebook @stripe” but at least my home is weirder than your home.
This project also made me sad. When I was a still kid living in a small village, I wanted to come to a big city because I thought it would be a melting pot of diverse ideas. While some cities are still wonderfully colorful, other cities like the Bay Area and Washington DC seem to be moving towards specialization, hosting only one type of people and endorsing only one kind of talk. I’m also concerned by the prevalence of brand names in our bios: we identify ourselves by the companies we work for and tie our self-worth to the brands we’re associated with.
There are so many more patterns I found from looking at the data, and there are so many more that I failed to recognize. I also didn’t go into more details about cities other than the Bay Area because I’m biased. For those interested, the rest of the visualizations about all cities can be found here. You can also play with the data here. If you find something interesting or just want to discuss the project, feel free to shoot me an email or find me on Twitter @chipro.
Acknowledgements: I’d like to thank Miles Brundage, Larissa Schiavo, Matthew Conley, Nguyen Pham, Jason Li, and Jessie Duan for helping me with this post. I’m lucky to have so many wonderful people in my life.