Estimation issues: When the number of observations is an independent variable

Posted on Tue 28 May 2024 in Research

One my coauthors, Xiaofeng Liu, and I were interested in understanding spillovers on contest platforms. While a theoretical model helped us ground the predictions, the actual empirical test turned out to be tricky to reason about (at least for me). Specifically, it turns out that the model predicts that the number of observations is itself an independent variable. Below is a simplified version which shows the specific empirical challenge.

On each day, an agent is provided a list of tasks to accomplish. The “quality” of outcome of each of these tasks might depend on the full list of tasks that the agent has for that day. In specific, if there were economies (of scope), then you may imagine that quality of outcome of a given task would increase when there are many other tasks. In contrast, if there were diseconomies, then quality of outcome of a given task might decrease when there are many other tasks.

Suppose for \(T\) consecutive days, you can observe the quality of outcomes of tasks that were assigned to some agent. In specific, suppose \(q_{it}\) is the quality of outcome of task \(i\) on day \(t\). Moreover, suppose the total number of tasks assigned to the agent on day \(t\) is \(n_{t}\). Thus, our economies/diseconomies model above says that \(q_{it}\propto\theta n_{t}\), with \(\theta>0\) (\(\theta<0\)) when there are economies (diseconomies).

Given our observations \(q_{it}\) (and \(n_{t}\)), how can we estimate \(\theta\)?

First off, assume that \(\mathbf{N_{t}}\), the number of tasks given to the agent on a given day, has some distribution. Furthermore, suppose that the quality of work on these task \(i\) on day \(t\) is given by \(\mathbf{Q_{it}}=\mathbf{Q}+\theta N_{t}\) where \(\mathbf{Q}\) is a random variable. Then, the likelihood of observing the data is given by

$$\Pi_{t=1}^{T}\Pr\left(\mathbf{N_{t}}=n_{t}\right)\times\Pi_{i=1}^{n_{t}}\Pr\left(\mathbf{Q}=-\theta n_{t}+q_{it}\right)$$

Consequently, the negative log-likelihood is given by

$$-\sum_{t=1}^{T}\left(\ln\Pr\left(\mathbf{N_{t}}=n_{t}\right)+\sum_{i=1}^{n_{t}}\log\Pr\left(\mathbf{Q}=-\theta n_{t}+q_{it}\right)\right)$$

Assuming that \(\mathbf{Q}\sim N\left(\mu,\sigma\right)\), minimizing the negative log-likelihood above is equivalent to minimizing \(\sum_{t=1}^{T}\sum_{i=1}^{n_{t}}\left(-\theta n_{t}+q_{it}-\mu\right)^{2}\). Furthermore, the latter is identical to a typical OLS! Thus, at least in this setting where the number of observations is itself an independent variable, OLS will yield reasonable answers.

Note1: Obviously, the standard error from OLS will be inaccurate for the model above. But maybe some bootstrap techniques might be useful in deriving the standard error and confidence interval of the estimates.

Note2: With serial correlation of \(\mathbf{Q_{it}}\), it appears that we can still obtain reasonable estimates using standard techniques; specifically, treating \(n_{t}\) as a typical independent variable should not bias our estimates. But I’m not a smart enough empirical researcher to think this through rigorously.