8. The myth of the engagement rate

We are going to investigate a benchmark for the levels of social performance of these profiles. Assume a handful of companies produce social media content, and each one receives a certain number of interactions, for example the number of ‘likes’.

Profile # /of Likes
A 1
B 10
C 100
D 1000
E 10000

We’d like to create an analysis which compares these social engagement levels. If we use the nominal number of interaction, we would say that firm E is the most engaging. This assumption is dubious without considering any other effects. For example, how much exposure does each of these profiles have? Are some very famous and others obscure? We would of course expect a profile of a well-connected Business, Public Figure and Organization to be more popular than its unknown counterpart. One way to control for this would be to also consider the size of the audience. Imagine the audience size was as follows:

Profile # /of Likes # /of Audience
A 1 10
B 10 100
C 100 1000
D 1000 10000
E 10000 100000

Now it makes sense that profile A has much less interaction, because the profile has much less potential exposure. Firm A has an incredibly smaller level of exposure, and therefore we expect that this limited reach. Firm E has an extremely large level of potential exposure, and therefore has more of an opportunity for its content to be seen and interacted. A reasonable next step is to consider this difference and try to scale the engagement to this different in audience size. To do so, we might divide the number of likes by the audience size to estimate a rate of engagement:

Converting this into percentage we get:

Profile # /of Likes # /of Audience Engagement Rate
A 1 10 10%
B 10 100 10%
C 100 1000 10%
D 1000 10000 10%
E 10000 100000 10%

The engagement rate is the fundamental statistic in social media analysis. The analyst would return with a result, in each case, for an engagement rate of 10%. That is, that each of these profiles are performing equally relative to their peers. Rather than simply comparing point estimates, we might like to create a model to better understand the underlying phenomenon and create a benchmark. To do so we might attempt to find the correlation between the audience and engagement via least squares estimation. In the simplest case we can use a single variate regression to measure the effect of audience size on the level of engagement. To do so we might create a scatterplot of the audience (independent variable) against the engagements (dependent variable). Plotting our small group of profiles would look like:

We would find a perfectly correlated regression. What is the interpretation of this model? It says that for each X increases in the audience size, we can expect a change in Y engagements. While Y.Y engagements might seem like a strange thought, we can say that XX additional audience followers produce Y increase.

What might be apparent from the graph, is that each profile increases in distance from the next, and takes the scale with it. That is firm E has an audience of 100,000 fans, the next is merely 10,000. What is also apparent is the right skewness of this graph. Though they may seem comparable in list form, visually both axes are scaled to firm E. The behaviour of the second most popular profile, firm D with 10,000 likes is quite negligible, as the remaining 3 profiles are grouped into a small area near the origin. At each progressive scale, the leading profile determines the visualization, the equation and therefore the insight. Although there are five profiles, only one scale is being displayed. Is it proper to assume that the behaviour of profile E is the same as the rest? Are we truly capturing the phenomenon in the lower range when aggregating it so simply? Our regression results tell us in fact that we have captured the perfect model, and yet something still feels incorrect.

To see how well our engagement model compares to the real world, let’s run an experiment. In this case let’s compare the number of engagements for a particular profile against its audience size, except this time for ten thousand of the most active profiles. When we plot the results of our experiment, we obtain the following graph:

Here we obtain a trend, in this case 0.0096x means that for one increase in the audience size, on average, we expect an increase of 0.0096 engagements. The y-intercept represents that if x=0, that nobody likes this profile, on average we expect 14,114 engagements. We may also notice that our previous skewness is once again present, the x-axis in fact is on a scale of 10 million, and yet only a small subset of profiles are anywhere near these values. A majority of the profiles again are bunched near the origin, with outliers becoming more frequent as the scale increases. Is this a proper test to be benchmarking a variety of profiles?

Consider that when we create a linear regression, we are implying a variety of assumptions are true about the underlying data. When these assumptions are satisfied, the Ordinary Least Squares estimator, that is an estimator which minimizes the squared residuals, becomes the Best Linear Unbiased Estimator, due to not only the consistency and unbiasedness of the model, but also the consistency and unbiasedness of their errors. The Gauss-Markov Assumptions include among another things, that the model is linear in parameters, that variables are uncorrelated with the error and each other, that variables are not collinear, that variables are exogenous, and that the errors are identically and independently distributed. This means that the slope of the line of our sample is consistent when it equals the population sample.

In particular, the fourth Gauss-Markov assumption, that the errors have the same variance is violated. Homoskedacticity, from the greek “same scatter”, requires that even at different scales the small variations between profiles have equal weight. This plot, however, appears heteroskedastic because the potential error or variation between profiles is quite different on each end of the graph. Toward the origin, the profiles are concentrated into a very small area, with slight variation between them. As the profiles increase in scale, the variation becomes much larger. Consider if we overestimated the fans of each by 100 fans:

Profile Original Add 20 Fans Percent Error
A 10 20 100%
B 100 120 20%
C 1000 1020 2%
D 10000 10020 0.2%
E 100000 100020 0.02%

The effect of a miscalculation of twenty fans has a very different effect on each of the profiles. For those at the lower end, it might be a large change; at the higher end it might be negligible. To a profile with 100,000 fans, a change in 20 might mean very little. To the profile with 10 or 100 it might mean much more. This means that the behaviour crosses scales. It means the dense area toward the origin cannot be described in the same manner as the other end. This means our principle of homoskedacisticity is violated. While the OLS estimator may still be consistent and unbiased in the mathematical sense, the variance of these estimators, and by extension the standard errors, and test statistics that allow for any hypothesis testing becomes inconsistent and biased. The presence of heteroskedasticity muddles the estimation and insight obtained from regression analysis. In this scenario the Ordinary Least Squares estimator is no longer the Best Linear Unbiased Estimator for the model.

How does this skew affect numerical analysis of the properties? The results force us to re-think the purpose of averaging in this scenario. Consider that calculating a regression by hand involves taking the averages of each variable and their squares in a manner to minimize the distance of a regression equation to fit the data. What are the implications of our data not being able to describe the phenomenon.

Let’s consider the average. The average is supposed to represent an regular or common result of a distribution. That is, if we were to add more and more data points and more and more results, that on average, these values would more often be found near the average than on the far tails. The more and more points we add, the more robust the average becomes. This is what drives the central limit theorem, and is the cornerstone of normal statistical analysis. It means that the average is somewhere in the middle of the data, that it represents the type of data we’re to be found in the middle of a probability distribution.

But skewed data implies that there is something else going on underlying the phenomenon, something that may not be normally distributed. Our heteroskedacisty diagnosis means that we must more deeply investigate the underlying distribution. Let’s consider how an average becomes skewed in the presence of non-normally distributed data.

If we consider our subset of profiles, we can consider the effect each one is having on the average, by measuring the share of the audience each has. We can do this by taking the number of likes for each profile and dividing it by the sum of the likes of this group (111110) and multiplying by 100:

Profile # of Likes % Share
A 10 0.009%
B 100 0.09%
C 1000 0.9%
D 10000 9.0%
E 100000 90.0%

This means that profile E has 90% of the likes, while the remaining only have 10% combined. Profile D and E combined has a share of 99% of the data. The distribution of likes is not normal, and therefore it skews the data toward that range.

What might be the best way to account for the comparative behaviour of all five nodes without having more than half become negligible?

Let’s consider the total average of likes of these profiles. If we summate we obtain 111110. If we divide by five profile, then we find that the average profile has 22222 likes. This average is then double the value of the next profile. We know that the data is skewed, so we might want to consider another measure for understanding the data. The median chooses the most middle value of a sorted series of data. In this case where we have five values, the middle profile is the third, profile C, with 1000 likes. The average says a typical profile has 22222 likes, but the median says 1000: which is more representative of the data?

Up until this point we have considered the engagement rate only in a statistical fashion, we have made simple averages in the way a sports analyst might compare two teams. What we have not considered is that the data which drives our analysis, the number of likes for a profile for example, is based on connectivity. We must also consider the effects of the entire network when considering analysis.

Social networks can be described by a graph in the form of nodes and links. Visually, a node is represented by a circle and a link is represented by a line. A connection is represented by a link between two nodes, or a line between two circles. What is fundamental in graph analysis is that the mathematics of the connectedness of an entire mathematics has properties of its own. That is, larger complex ecosystem can be represented by simple rules when analysing the emergent properties of the system. The interplay of emergence and complex topologies from local rules is the cornerstone of complexity theory and drives complex network analysis.

Graph topology is traced to Euler’s ‘Seven Bridges of Koneningsberg’ Problem, which relates the abstraction of the connectedness of bridges to these mathematical properties. When analysing the engagement rate, as well as these profiles, along with statistical assumptions we are making assumptions into the underlying topology as well.

We can take a sample network under that assumption, and create a degree distribution by adding up the number of links:

A normally distributed network means that it decays exponentially. 50% of the nodes has 1 degree, 25% have 2 degrees, 12.5% have 3 degrees, and continues to decay. It is one side of the bell curve, the normal distribution. The most connected node or the hub has 17 connections. This node occurs 0.002% of the time. In fact, less than 1% of the nodes have more than 7 links. So even after adding 20,000 nodes– twenty thousand nodes–at random, the highest only had 17 connections. This is known as an Erdos- Renyi Random Graph.

If we say that a profile is representive of the average, it means that on average, an average would be most likely to be selected. This means that on a graph, if we have a collection of nodes, we are most likely to select a node with an average number of links. This means that the median would also be found near the mean, which we know is not the case for our networks.

We know that currently that our model is not capturing the underlying features due to the presence heteroskedacistiy, which renders our errors and hypothesis testing arbitrary. Where does this heteroskedacisity come from, and how can we correct for it?

Our linear assumptions are built on the basis that the data is normally distributed and can be expressed in linear parameters. We must toss this assumption in favour methods that do not assume the underlying distribution. We can use non-parametric analysis to describe the underlying distribution. We can first investigate this by creating a histogram of each of our datasets.

First let’s consider strictly the variable of the audience size. Let’s run another experiment. This time we will obtain the number of likes for four million profiles. Once we obtain these, let’s plot the frequency of each value occurring on a plot.

We see once again the frequency vanishes very quickly, but in this network there are many low-frequency highly connected nodes. Even though this is the same number of total nodes, the change in connectivity yields certain nodes far more connected than the rest. In this model we have nodes with over 400 connections, compared to at most 18 in our other model.

If we compare both network topologies on the same graph, we see the difference in the hub formation is even more dramatic:

It is the heterogeneity of this hub formation which drives the heteroskedasticity of our linear model. We might want to transform this frequency. Often with heteroskedacistic data, the goal is to find a some linearization method; this may be to include quadratics or logarithms. If we transform these distributions to the logarithmic axis, we find:

The ERRG quickly decays, while the BAPA decays linearly on the log-plot. Let us now consider the degree analysis of the profile audience distribution.

We have found a topology with very few highly connected nodes, and very many lowly connected links. To account for the variation in scale, we might want to compare the data on a transformed axis. We can transform each value to the logarithmic axis. When we do so for four million nodes, we obtain the following distribution:

On the y-axis we find the frequency, or the odds that a particular # of likes will be found in the dataset. On the x-axis, we find each of the potential values for the number of likes. Close to the origin on the x-axis is a low number of likes, which compares to a high probability. As we move positively along the x-axis, the probability of the frequency decreases. There are also very large outliers as the values increase, which we already knew is the skewed drivers.

This linear function described by a log-log plot is a power law, and describes heterogeneously distributed data. It traces its origins to Pareto’s 80/20 Law and is related to the Yule-Simon distribution and Zipf’s law. It is a realization of the concept that ‘the rich get richer’, also known as cumulative advantage or the Matthew Effect.

This rich get richer phenomenon can be described by a thought experiment known as the Polyna Urn Scheme. Imagine a tank of balls, each one with a different colour. Start with 5 balls, each one colour. Pull out a ball at random and mark the colour, return the ball back into the urn and additionally add another ball of the same colour. The balls that are more selected are more likely to be selected again because each time a ball of that colour is selected it is rewarded. In fact this result reflects a specific phenomenon. There is a particular class of networks, known as scale-free networks which are the realization of this effect. Scale-free networks are described by a highly heterogeneous topology which has very few well-connected hubs and many lowly-connected links. The degree distribution follows a linear trend when described by a power law of the form P(k) = Ck^-L where L is typically found 2 < L < 3.

So we know we must account for the properties of scale-free networks if we want to properly analyze. What is the difference at the micro-level between simple and complex networks? How does the growth of the network at a micro-level determine the Consider two nodes, A and B:

The question is, which node does an incoming node attach to? In this case, it is a coin flip, as there is no discernible differentiating property of either node.

Assume we add the new node, C, to node B, and now we want to add node D. If we select a node to connect to randomly again, the odds we will select each one is 33%:

Let’s consider that node B however has 2 connections, while node A and node C each one connection. If we scale each node in accordance with its number of links:

We see that node B is more connected. If we bias our choice towards a more connected node, that is, one of relative degree to the total number of nodes, then we can measure the influence of each node as:

Mathematically this is defined as Barabasi-Albert Preferential Attachment, and is a network-based realization of the rich-get-richer phenonomenon. We can represent this mathematically as:

This means there is a linear relation between the probability that a node will be selected as an incoming node and to its current relative degree. If we repeat this preferential process twice more, we may arrive at a network such as:

This network has relative degree probabilities of:

It is this core decision making process that drives heterogeneity of the scale- free network, creates the hub topology, and ultimately creates the heteroskedacisity in the econometric regression model.

Now that we have discovered the network topological basis for our heteroskedacisticty, we can revisit our model using these properties. Now in our model, we will use a regression, except we will keep the scale we discovered in the power law distribution. When we account for this logarithmic scaling, our estimations change. Let us consider our five profiles again:

Profile # /of Likes Log(# /of Likes)
A 10 1
B 100 2
C 1000 3
D 10000 4
E 100000 5

If we now take the average of our logarithmic data, then we see that the total sum is 15, and our average is 3. Notice the median is also 3 in this case. If we then take the anti-log of this mean, we obtain 10^3 = 1000. This is precisely the median of our non-transformed logarithmic data. We can confirm this multiplicity by considering the geometric mean, which is (sum profiles) ^ (1/N) which is also 1000. With the logarithmically transformed data, we can confirm that the ‘average’ profile is once again related to the median and truly represents the average. Let us now consider our regression once again, this time with logarithmically transformed variables:

We see that we not obtain a proper benchmarking regression which accounts for the proper scaling of both the audience and the engagement. The interpretation of logarithms is one of multiplication and ratios. The interpretation of the logarithmic regression is that a 1% increase in X, produces a B% change in Y. It is therefore a measure of the elasticity of a particular profile. Comparing to the residuals of the data, this creates a valid benchmark for analysis. We obtain better correlations of co-efficient and an actual trend by which to compare. The Engagement Rate is arbitrary without logarithmic considerations due to the heteroskedasticitiy produced by the natural bias in network selective growth. A logarithmic corrective estimation is most appropriate to restore Ordinary Least Squares at the Best Linear Unbiased Estimator.