Listen to this article
We all use search engines such as Google, Edge, and DuckDuckGo to conduct all kinds of searches on the internet. In this regard, a key question is the following: Does the superior quality of search results from dominant firms like Google stem primarily from better algorithms or from access to larger volumes of user-generated data? This distinction has salient implications for competition policy, innovation incentives, and consumer welfare. Why? If algorithmic superiority explains dominance, then competitors can challenge incumbents by developing better technology. However, if access to vast amounts of user data is the key driver, then incumbents enjoy an enduring advantage in the sense that they are able to create de facto entry barriers that diminish incentives to innovate.
Search engines rely heavily on data generated by users. Every query and subsequent click contribute to query logs, which help refine future search results. The challenge for researchers seeking to shed light on this “algorithm vs. data access” question lies in isolating the causal effect of data availability from that of algorithms. Usage data are proprietary, and the number of past searches is not exogenous and this complicates empirical analysis. To overcome these hurdles, Tobias Klein and his colleagues collaborated with Cliqz, a small German search engine, to conduct a controlled experiment and shed valuable light on the above “algorithm vs. data access” question.
The study’s design involved fixing the Cliqz algorithm while systematically varying the amount of user-generated data it could use to generate results. Queries were categorized into five “buckets” by frequency—from the most common to the rarest—and a representative sample of 5,000 queries was used. Non-personalized search results from Cliqz, Google, and Bing were obtained for comparison. Human assessors, unaware of which engine generated the results, rated their quality using a 7-point Likert scale.
The findings reveal that for popular queries, Cliqz’s results were comparable in quality to Google’s and Bing’s, suggesting that algorithmic differences are not the main determinant of search performance in these cases. However, for rare queries—which constitute 74% of total traffic—Cliqz’s results were significantly worse. This shows that access to more user-generated data is crucial for improving the quality of results, particularly for less frequent searches.
The experiment further demonstrated that diminishing returns set in quickly for popular queries. In other words, once a moderate amount of data is available, additional data no longer improves quality. In contrast, rare queries exhibited no such saturation, with quality continuing to improve even as more data were added. A robustness check comparing the overlap between Cliqz’s and Google’s top five results confirmed the following finding: rare queries require more data for quality convergence.
These results have significant policy implications. Since rare queries dominate traffic, without sufficient user data, new entrants face considerable disadvantages, reinforcing the status quo and possible monopoly power. This dynamic also supports arguments for mandatory data-sharing by firms, as envisioned by the European Union’s Digital Markets Act. Because user data are non-rival, sharing would not reduce incumbents’ quality but would erode their exclusivity advantage, potentially restoring competition and incentives to innovate.
The research I am discussing here is valuable because it provides rare experimental evidence on the causal link between user data and search quality. While prior research has established correlations between data scale and performance, this study demonstrates directly that user-generated data are indispensable for competing with incumbents. In addition, the findings underscore the salient point that the biggest quality gaps between dominant and smaller search engines arise not from algorithms but from disparities in access to data.
In sum, without data sharing, market entry will remain difficult, reducing competition and innovation. Even as emerging technologies like large language models improve algorithms, the need for sufficient user data persists. Put differently, effective regulation ensuring data access can help level the playing field and thereby enhance consumer welfare in digital search markets.
Batabyal is a Distinguished Professor, the Arthur J. Gosnell professor of economics, and the Head of the Sustainability Department, all in the Rochester Institute of Technology, but these views are his own.
”