There are a number of overlapping reasons it is difficult to build large health data sets that are representative of our population. One is that the data is spread out across thousands of doctors’ offices and hospitals, many of which use different electronic health record systems. It’s hard to extract records from these systems, and that’s not an accident: The companies don’t want to make it easy for their customers to move their data to a competing provider.Miner, L. (2019). For a Longer, Healthier Life, Share Your Data. The New York Times.
The author goes on to talk about problems with HIPAA, which he suggests are the bigger obstacle to the large-scale data analysis that is necessary for machine learning. While I agree that HIPAA makes it difficult for companies to enable the sharing of health data while also complying with regulations, I don’t think it’s the main problem.
The requirements around HIPAA could change overnight through legislation. This will be challenging politically and legally but it’s not hard to see how it could happen. There are well-understood frameworks through which legal frameworks can be changed and even though it’s a difficult process, it’s not conceptually difficult to understand. But the ability to share data between EHRs will, I think, be a much bigger hurdle to overcome. There are incentives for the government to review the regulations around patient data in order to push AI in healthcare initiatives; I can’t think of many incentives for companies to make it easier to port patient data between platforms. Unless companies responsible for storing patient data make data portability and exchange a priority, I think it’s going to be very difficult to create large patient data sets.