The Effects of Big Data on Commercial Banks 

By | December 5, 2022

The past decade has witnessed a revolution in the use of massive amounts of data for businesses’ decision-making. According to a report by Forbes, there were 2.5 quintillion bytes of data created each day in 2018. Most of the time, the volume of data is too great for people to handle. Thanks to the advancement of data storage and processing technology, business leaders have begun to leverage the use of big data to extract hidden patterns in their customers’ behaviors that cannot be directly identified by humans. The proliferation of big data in the banking sector is of particular interest because of the reliance on data analysis for most banking activities. Drawing on a quasi-experiment in China, I provide some causal evidence about the impacts of providing banks with a large amount of firm-level operational information. 


Starting in 2014, local government agencies in China started sharing administrative data with commercial banks. The practice is targeted to help banks reduce information asymmetry and identify better borrowers. For privacy concerns, many local government agencies contract with third-party financial service companies that bridge banks and governments. Companies serve to help the commercial banks store and organize data and claim legal responsibility for data security concerns. I use information from the earliest and largest third-party data provider (the provider hereafter) to compare how banks’ operating activities change before and after the company steps in. The provider’s business starts in 2014 and they serve 44 commercial banks with a market share of over 90% throughout the sampling period.  

The data shared with the banks includes firms’ detailed balance sheet information, tax history, ownership structure, and the credit history and history of legal activities of the firms and the firms’ controlling shareholders. The data do not contain any variables the banks cannot get before the experiment. The main impact of the experiment is the size of the information provided. On average, each bank is shared with the information of more than 250 thousand firms with hundreds of variables for the initial provision of the data, and this information is continuously updated.  

Big Data and Screening Ability 

Naturally, the first thing to look at is how bank screening ability changes after receiving an extensive amount of firm information. The effects of the experiment are retrieved by comparing banks that the provider shares and doesn’t share the data with and are otherwise arguably identical. The results show that providing banks with a large amount of data significantly increases the bank’s screening ability. Specifically, for the treatment group (those banks that are shared the data), the explanatory power, measured with the pseudo R-squared from predicting the borrowers’ default using bank ex-ante proprietary credit score with logistic regression, is 49.92% larger using the post-experiment data compared with that using the pre-experiment data. In contrast, the difference in the pseudo R-squared for the control group (those that are not shared the data) before and after the experiment is only -0.83%. The quasi-experiment also greatly affects the loan characteristics of the treated banks. Specifically, the experiment induces a 4% increase in average loan volume, a 0.47 percentage point increase in interest rates, a 2.97-month increase in loan maturity, and a 0.25 percentage point decrease in default rate. The changes are economically very significant, and are all statistically significant  at the 1% level. 

The experiment seems to have a very significant positive effects on the lenders. But does it impact different banks to the same degree? This question relies on how we should think about data versus technology. Big data enables lenders to extract more high-dimensional information through systematic statistical inferences. However, the availability of statistical inference tools is largely different for banks with different levels of technology. Since big data is expected to be effective only for those who can use the data efficiently, the availability of a larger amount of data due to the experiment is expected to also have significant effects only for banks with high information technology (IT) capacity. Consistent with this conjecture, the effects of the experiment are nearly entirely driven by the banks with high IT intensity, defined as the average IT spending scaled by total expenses five years before the experiment. Specifically, for the banks that are treated and with high IT intensity, the pseudo R-squared from predicting the borrowers’ default using the bank’s proprietary credit score is 77.79% larger. While for those with low IT intensity, the pseudo R-squared barely changes. The results are similar for the loan characteristics. For banks with high IT intensity, average loan volume increases by 8%; interest rates increase by 0.51 percentage points; loan maturities increase by 4.09 months; and the default rate decreases by 0.47 percentage points. At the same time, there are no significant changes in the loan characteristics for loans initiated by banks with low IT intensity.  

How Should We Think about the Findings? 

The findings are expected in several dimensions but less expected in others. For example, with more data, banks are expected to have a better screening ability on average. In turn, they would channel funds to those with a lower default rate. The reasoning is straightforward: Before a shock increases bank screening ability, banks can’t distinguish between some of borrowers with higher risks from those with lower risks. They then charge some interest rates that are too high for low-risk borrowers but too low for high-risk borrowers. With a better screening ability, affected banks now know who these good borrowers are and would then direct more funds to borrowers with better quality.  

However, a more interesting result is the simultaneous increase in interest rates and decrease in default rate. That is, despite a lower cost of funding, the lenders on average charge a higher price for the loans after the experiment. This is inconsistent with a credit market being perfectly competitive. In such market, lenders are expected to break even on their earnings. So big data that on average reduces default rate should also decrease the average interest rate. Instead, it seems to suggest that big data decreases the competitiveness of the commercial lending market. That is, big data improves screening ability greatly. Nonetheless, it only does so for some banks. For others, big data is useless because these banks can’t use it efficiently, so these banks remain unaffected despite the availability of a massive amount of data. A direct consequence is that good borrowers from the unaffected banks turn to the affected banks. This is because the latter knows that these borrowers have a lower default rate and would offer a lower interest rate. In a market that is not perfectly competitive, the demand for fund increases for the banks with high technology capacity. These banks then enjoy a higher market power while cream-skim good borrowers from the others. The findings support this hypothesis. After the experiment, there are more borrowers with lower perceived default risk start to cut their current relationships and borrow from banks with higher IT capacity. 

The increase in demand that raises market power also has some externalities for those whose perceived creditworthiness remains unchanged after the experiment. Specifically, the results show that the average interest rates for those that have borrowed both before and after the experiment with roughly unchanged perceived creditworthiness also start to pay higher interest rates. Why don’t these borrowers go find alternative funding sources when facing a higher interest cost? One reason is the so-called winner’s curse problem. Specifically, thanks to the data provision program and their high technological capacity, the truly affected banks now enjoy an information advantage over their borrowers. If some other banks could outbid these affected banks, it is more likely that this is because the affected banks are informed with a higher default rate by their credit-scoring models. Therefore, the uninformed banks are cursed in the sense that they can only win the bid when they can’t use the data as efficiently as the affected banks and post an interest rate that is too low. 

About Banking Market Concentration 

The results show that big data has differential effects on banks, with larger positive effects on those with a higher technological capacity, namely, those who can process the data better. Another angle is how big data is expected to affect the banking market concentration. In general, it is more natural to believe that big banks are those that invest more heavily in technologies. For example, a report from the Financial Brand documents that 50% of banks with over $50 billion in assets have adopted a data analytic strategy, compared with only 9% of banks with less than $1 billion in assets. Therefore, a conjecture is that the data provision program would increase the market concentration by asymmetrically benefiting the big banks more. Indeed, the data shows that the loan Herfindahl-Hirschman Index (HHI) increases by 6.43% two years after the experiment for the provinces where at least one bank is provided with the data. In comparison, in the provinces where no bank is provided with the data, the loan HHI decreases by a mild 2.86%. This finding confirms that the availability of big data tends to increase the banking market concentration. 

Xiao Yin is a Doctoral student at the University of California, Berkeley Haas School of Business.  

This paper was adapted from his paper, “The Effects of Big Data on Commercial Banks,” available on SSRN 

Leave a Reply

Your email address will not be published. Required fields are marked *