Data-Metrics Circularity in PE


Many businesses in ‘traditional’ industries are now using data science to rethink how they operate. That’s because data-science tools can empower people at all levels of an organization to make decisions that more closely align with empirical facts, rather than basing choices primarily on past practice, intuition, or best guesses. Put differently: data science lets businesses behave scientifically, and improve performance by doing so.

Obviously, the value of businesses behaving scientifically was understood well before the rise of data science, as evidenced by Peter Drucker’s long-famous observation that businesses can’t manage what they can’t measure. What data science has done, however, is to highlight the importance of a fundamental circularity between data and metrics.

  • Metrics are constructed from data, and so their quality and richness are restricted by the quality and richness of the datasets from which they are calculated.
  • Data is rarely free – in the sense that there is some cost (however negligible) to acquire, process, and maintain it. Consequently, organizations can responsibly manage only a limited amount of data, and the data they choose to manage is primarily driven by what metrics they prioritize.

Metrics and data are therefore mutually influencing: better data leads to improved metrics, and sharper metrics motivate sourcing new, high-quality datasets; but scarce, unreliable data yields impoverished or misguided metrics that can trap organizations into making poor decisions from weak data.

Historically, this circularity has been harmful to the private equity (PE) industry. Limited availability of data has prompted investors to fixate on under-informative metrics, and therefore has impeded their ability to understand and make responsible decisions on investing in PE funds. For the most part, this circularity persists because of one-and-a-half falsehoods about data and metrics in PE. These falsehoods are as follows:

  1. The PE industry is necessarily opaque and secretive, which means that investors must simply accept the fact that data on PE funds and assets will be scarce and of low quality.
  2. Large, fine-grained datasets are not really needed for PE investing anyway, because a handful of popular heuristics is sufficient for making decisions about PE funds.

The first claim is false. And we say the second claim is only half-false because, until recently, not enough data has been available to rigorously test the legitimacy of many of these heuristics.

For example, there is widespread belief in persistence of top-quartile performance – i.e., a PE fund manager that has delivered returns in the top 25% of its peer group in previous funds is highly likely to do so again in future funds. As a prima facie assumption, this seems intuitively reasonable: better past performance by a PE manager should bring it better resources (such as helping it to attract talented employees, robust deal-flow, and favorable terms in negotiations), which would make strong future performance more likely. But there is a reasonable counterargument to this “winners keep winning” view: PE managers who have had success in past funds can, for the most part, subsequently raise bigger funds (and they are incentivized to do so). Yet bigger is not necessarily better. With more money, managers must translate their previous success rate to a larger number of deals – or to larger deal sizes – in order to repeat their earlier performance.

Observably, the persistence of top-quartile performance is a mixed truth. Some managers consistently deliver top-quartile returns across many funds, while others cannot. But the prevailing suite of metrics that is currently used to analyze PE performance is ill equipped to make such distinctions in advance – largely because it is unknown what metrics are most meaningful for doing so, which is itself a result of the paucity of relevant data on fund behavior.

Fixing PE – in terms of making fee structures and partnership agreements more aligned with the needs of institutional investors (e.g., pension funds and endowments) – will require repairing this circularity between data and metrics. Driving these fixes is our major motivation in launching the X-SPEEDS (ExplorationS in Private Equity with Data Science) program. We aim to discover, and empower others to discover, robust metrics that lead to more informed decisions for investors. But to discover these new metrics, we must start by building world-class datasets.


By “world-class” datasets, we mean datasets that are capable of delivering deeper, more relevant insights than any other available. But what might such a dataset look like for PE analysis? In the following, we discuss some of the properties that a world-class PE dataset should exhibit – properties that we used to guide construction of our own world-class dataset.

Clearly, a world-class dataset for PE analysis should be extensive – it should cover a sufficient proportion of all PE funds worldwide, and contain rich information on each. Unfortunately, most existing datasets are highly partial in their coverage (they tend to specialize in specific geographies or fund types – e.g., venture capital or buyout funds) and consist mostly of simple variables, such as fund vintage, location, and size. To achieve greater extensiveness, we have therefore combined multiple existing datasets and augmented them, so that our dataset not only has more complete coverage of the universe of PE funds worldwide, but also sufficient details about each individual fund it contains. Although our dataset is not exhaustive (it does not cover every PE fund in the world across history), we have tried to ensure that it is representative: by cross-referencing our dataset with lists of funds worldwide, we have attempted to make our dataset contain fund types in proportion to their fraction of the global population.

Further, any world-class PE dataset should obviously capture information on fund performance. But end results do not go far enough to deliver truly powerful insights into the asset class, and help investors to craft more appropriate strategies for participating in it. What is additionally needed is data on dynamics – i.e., data not just on the returns a fund eventually achieved, but also the interim changes in value and actions it took en route to achieving them. One might initially think that dynamics data might not be that helpful: after all, private equity is a highly illiquid asset class – might investors get away with only caring about end results? The answer is that they should not.

Despite its illiquid nature that locks investors into long holding periods (relative to other asset classes), responsible investors still must ‘manage’ how they engage with PE funds in their portfolios. Doing so can entail, for example, balancing risk exposures in the PE segments of their portfolios by making changes to more liquid segments, or strategically managing their overall liquidity in order to more efficiently handle the capital calls and distributions generated by PE funds in which they invest. The latter is of utmost importance. When an investor commits capital to a PE fund, it is not received in full by the fund manager upfront, but is instead “called” over time (usually over the course of several years). Likewise, many PE funds return capital to investors in increments, rather than all at once at the end of the fund’s lifespan. Properly managing cash levels is therefore crucial for investors with PE holdings, and can affect their overall returns and risk profiles (in addition to impacting their ability to honor their commitments).

Historical cash-flow data – i.e., records of when PE funds called and distributed capital to investors – is therefore crucial for a world-class dataset for analyzing PE. And the “when” here is essential: it is necessary to granularly know at what point in its lifespan a PE fund made capital calls and distributions. This need is tied to scale issues in the size of the PE market. Pointedly, there are only so many funds in the PE universe overall (on the order of hundreds at any point in time, rather than, say, tens of thousands). If one is trying to make focused predictions, then having funds that are sufficiently similar to one another is vital – e.g., in terms of size, strategy, geographies of focus. But the number of funds in any one cluster of ‘apples-to-apples’ peers is unavoidably small. This is a problem because most methods in data science must have a sufficient sample size to be valid. One way to circumvent this problem is to expand the ‘time box’ of comparable funds by including funds with relevant characteristics that were started in different windows. But knowing how large such windows should be is itself a matter that must be solved adaptively.

This need for adaptive window sizing highlights a key consideration in applying data science for PE analysis: spreadsheets can be ungainly tools because they make adaptive divisions in samples ungainly. There is a genuine need for those who work with PE data – whether they be investors, researchers, or others – to break up with spreadsheets. In their place, more elegant tools are appropriate. For our research in the X-SPEEDS project, we have heavily benefitted from the capabilities of Jupyter notebooks, as well as the power of RCI’s portfolio analysis toolkit. We firmly convinced that using power-tools like these is the future of PE research and analysis.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s