How much information should I collect in onboarding?

Some simple economics


When building a website, it's important to know how much information to collect from customers. There's a simple tradeoff: Ask too little, and you can't serve your customers well. Ask too much, and customers will not want to continue business with you. This is even more acute if your company ships physical goods and pays for those costs. In that case, the marginal costs for shipping and the goods make it doubly important to ask for information when a customer signs up (this is called onboarding). Sending a shipment to a customer without knowing the customer would result in wasted money.

Here, I build a simple economic model that derives optimality conditions and implications in a scenario where the company's only choice is to decide how much information to collect during onboarding.


We design the sign-up process for a website. We have one control variable, the amount of information we collect from the user. You can think of it as the time it takes to complete the onboarding. Users do not like to spend time on onboarding, so the more information we ask from them, the lower the probability that users sign up.

If a user converts, the user generates revenue (through purchases), but these revenue also come with a fixed cost. You can think of these costs as shipping costs and the cost of buying inventory. I model all this in a simple way: Assume that there is exactly one shipment, with shipping cost of \(c\) and revenue of \(r\). Revenue increases in the information collected because, with more information, it's easier to serve the customer's needs.

Instead of modeling the control variable directly, we instead simplify (without loss of generalizability) to just model the relation between the revenue \(r\) and the conversion probability \(p(r)\).

This brings us to the value function of

\[ V(x) = p(r)[r - c] \]

As laid out in the text, we are assuming \(p'(x)<0\). We are also assuming that the standard condition of \(p''(x)>0\) holds, which ensures convexity and thus a unique global optimum of the objective function.

Here, I could go on by deriving a first-order condition and stating conditions about the optimal amount of information required. It turns out that this isn't necessary: The value function is identical to the problem of maximizing firm profitability, and it has a well-known and beautiful solution, the markup rule:

\[ r^* = c \cdot \frac{1}{1 - \frac{1}{\epsilon}} \]

Where \(\epsilon\) is the elasticity of demand that describes the percentage change in the conversion rate \(p\) for each percentage change in revenue \(r\).

So the revenue from the optimal amount of information is a scaled version of the marginal cost. The higher the elasticity, the closer the revenue to the marginal cost, and thus the less information collected. This makes sense: If your customers hate that you ask them questions, then you should ask fewer questions.

From this equation, we get one direct implication: The higher the marginal cost, the more information you should collect from your customers. Let's say that your marginal cost increases. The economist's first response to that is the optimal prices should increase. The model above has the less obvious implication that you should also be more aggressive in collecting information from your customers.

The markup rule can also be used to compare industries. For instance, compare a software company, with negligible marginal costs, with a company that ships physical good. Because the software company has much lower marginal costs for achieving a certain revenue, we'd expect them to ask their customers for much less information. The equation is neat because it not only predicts the direction, but also the magnitude of the effect.

What if some users have higher revenue than others? Let's model that by allowing for a user-specific revenue factor \(f\):

\[ V(x) = p(r)[r\cdot f - c] \]

For intuition, it's useful to think of \(f\) as a multiplier that is one on average. With this new objective function, the optimal value of \(r\) is

\[ r^* = \frac{c}{f} \cdot \frac{1}{1 - \frac{1}{\epsilon}} \]

Since there is a one-to-one relationship between \(r\) and the information we ask for, this implies that we need to ask less valuable customers for more information. However, this equation has an even stronger implication: The customer's total revenue, \(r^*f\), is independent of \(f\), assuming we are holding the elasticity constant as we are changing \(f\).

This is a signal that the formulation I chose might be too simplistic and potentially misleading. As noted above, I started the model by simplifying: Instead of modeling the control variable \(x\), the information to be collected from the customer, I instead modeled the relation between the user revenue and the conversion probability. To understand the mechanisms better, it might help to directly model \(x\) and write:

\[ V(x) = p(x)[r(x)\cdot f - c] \]

If I pick the specification of

\[ p(x) = p_0 + c_pe^{-b_p ln(x)} \quad r(x) = r_0 + c_re^{b_r ln(x)} \]

Then I can derive a couple of special cases. For instance, if \(p_0=0=r_0\), then both \(p\) and \(r\) have a constant elasticity, and the markup rule holds globally. However, this does look different with constants: Usually, we obtain that the optimal revenue will increase in user quality \(f\). For instance, consider a case in which the exponent in \(p(x)\) is large. In that case, it would never be reasonable to have equal revenue for high-value and low-value customers, because it's simply too hard to move low-value customers away from their default revenue of \(r_0\).