An Analysis of Y Combinator Startups

Medium Article

Data source: https://www.snappr.co/ycdb

Skillset: Web scrape, ETL, Web development, Data transformation

Technology : Python(Beautifulsoup, Pandas, Matplotlib, Seaborn), Ruby on Rails (Web), JavaScript (D3.js)

Data transformation : html -> Jupyter/Python -> D3/JavaScript

Things to take note about the data from Snappr:

  • I scrape the data from Snappr on 22nd October 2017
  • Snappr data is incomplete as Y-combinator website states that they funded 1464 startups
  • Funding amount is not 100% accurate
  • Some companies don't show their fundings, Snappr don't display the amount, hence, I label them as 0
  • Some of the companies have $0.1M for funding, I suspect that it refers to the $120K given by YC
  • Location is not accurate. For example, enter Tech in Asia, Xfers, Semantics3 or Saleswhale in Table 1 search bar, the location is not Singapore though they are Singapore startups
  • You can see the data via console.log

Table 1: Original data from Snappr

var data = [{FAVICON: 'string', COMPANY: 'String', ... DESCRIPTION: 'STRING'}, object, object, ...];
FAVICON COMPANY BATCH DOMAIN LOCATION STATUS FUNDING (in millions) CATEGORY DESCRIPTION

Table 2: Aggregate data based on location

As mentioned previously, the locations for startups are not accurate. And since Y combinator is located in Mountain View, the data will be biased.

Total startups = exited + live + dead + unknown
Failure rate = dead startups / total startups
Success rate = exited startups / total startups

Sorting SUCCESS column, going to page 3 (since first 2 pages have sample size less than 10), startups located in Silicon Valley (Palo Alto, Mountain View, San Francisco) have the highest success rate.

Sorting FAILURE column, going to page 3, startups located in Silicon Valley do not have the highest failure rate.

LOCATION FUNDING (in M) AVERAGE FUNDING (in M) TOTAL EXITED LIVE DEAD UNKNOWN SUCCESS (%) FAILURE (%)

Table 3: Aggregate data based on batch

Sorting AVERAGE FUNDING column doesn't show concrete insights as unicorn startups (Airbnb, dropbox etc) will have significant impact.

Sorting SUCCESS column, using the top 10 batch, comparing it with S17. It seems that startups take an average of 8 years to exit. Since the exact dates of exited are not given, accept the number with huge pinch of salt.

var success_batch = [S05, W10, W06, S10, S11, W09, W07, W12, W11, S08]
var success_batch_halfYears = [S17-S05, S17-W10, ... , S17-S08];
var success_batch_halfYears = [24, 13, 21, 14, 12, 17, 19, 9, 11, 18];
var average_batch_to_exit = success_batch_halfYears.reduce((a,b)=>a+b,0) / 10; //15.8
var average_year_to_exit = average_batch_to_exit/2; //7.9
BATCH FUNDING (in M) AVERAGE FUNDING (in M) TOTAL EXITED LIVE DEAD UNKNOWN SUCCESS (%) FAILURE (%)

Table 4: Aggregate data based on status

We are going to focus only on exited and dead startups.

var yc_net = exited_startups * exited_average_funding * est_return_rates * yc_shares - dead_startups * yc_money
var yc_net = 168 * 6.73 * 4 * 0.07 - 87 * 0.12
var yc_net = 306.14

Based on estimation, YC gain at least $306 million. And the failure rate for YC startups is 6.6%. This is a huge contrast for the high failure rate for tech startups.

STATUS FUNDING (in M) AVERAGE FUNDING (in M) TOTAL % of ALL STARTUPS (1317)

Table 5: Keywords extracted from startup categories

Each startup has a category of string datatype, I parse the keywords inside each category separated by comma

Airbnb startup category = "Travel and Tourism, Commerce and Shopping"
Airbnb has 2 keywords = Travel and Tourism, Commerce and Shopping

Unsurprisingly, Y combinator startups are focus on technology based on the keywords.

You can use table 1 search bar to look at startups with the keywords.

KEYWORD FREQUENCY