GDPR Compliance and Survivorship Bias: Implications for Learning Analytics Platforms
Abstract: In April 2016, the European Union passed GDPR (General Data Protection Regulation), a law that intends to preserve individuals’ privacy with respect to their data and online activities. A consequence of this law is that firms have had to adjust their data use and data storage practices. In addition to requiring consent prior to collection of user data, GDPR allows users to request that their data be removed from a company’s system (Hintze & Emam, 2017)(Tesfay, Hofmann, Nakamura, Kiyomoto, & Serna, 2018). This is just one of many forms of legislative compliance that educational technology firms must address in designing products. Educational technology must also comply with: CIPA (Children’s Internet Protection Act), COPPA (Children’s Online Protection and Privacy Act), FERPA (Family Educational Rights and Privacy Act) and other country-specific regulation. These laws may have complex and unforeseen interactions, particularly as learning platforms collect increasingly granular data and these platforms use this data to fuel intelligent tutoring systems, artificial intelligence, recommendation engines, and other tools.
In this paper, we present mixed-method results. In the first stage, we interview more than a dozen educational technology firms. These firms comprise learning analytic platforms, language learning apps, educational game makers, and online course providers. We interviewed these firms with respect to their usage of AI in their learning platforms, their data storage practices, and data removal practices. In the second stage, we simulate these data retention policies to understand how user self-deletion may bias estimates of an assessment’s difficulty as well as a student’s percentile rank.
We found that most firms relied on relatively simple methods to understand learner behavior. Firms typically relied on classical test theory. There were relatively few firms that used more advanced forms of psychometrics. These tools tracked a student’s knowledge state and tried to provide them with learning opportunities that were both challenging and within a student’s zone of proximal development. This technology required the platform designer to track not only a given student’s knowledge state but also item difficulty and measures of similarity to other students.
With respect to data storage practices, we identified two distinct strategies. A handful of providers cited that they kept track of no user-level data. These firms stated that their technology saw small or no gains from user-specific customization. These firms also cited that there were non-trivial compliance costs if they were to start tracking individual users. For firms that tracked individual users, most firms keyed their data with respect to this level of granularity. These firms also actively tried not to collect personally identifiable data such as race or gender. These firms cited concerns over these tools revealing potential bias in their products.
With respect to data removal practices, many firms stated that they erred on the side of hard deletion of user data when users requested. Part of the rationale they expressed was that with sufficiently rich user interaction data, individuals could be easily reidentified. A result of this hard deletion strategy is that estimates of item difficulty and other features of the platform are now subject to measurement error and survivorship bias. The other thing of note is this deletion process often lacks its own reporting mechanism so several of these firms have no idea the extent of user self-deletion. Most consumer-focused firms cited little concern for this issue, citing relatively low utilization of data removal. However, firms that had contracts with school districts cited concerns that losing a specific district could bias estimates substantially.
Using this qualitative information, we then simulated data and systematically removed it under three distinct scenarios. In the first scenario, we assume that users are able to self-delete over time. In the second scenario, we engage in cross-country comparison where one country allows users to self-delete. In the third scenario, we assume school districts can self-delete. In all of these scenarios, we find that adding noise to the data, counting aggregate features prior to data deletion, and utilization of online learning algorithms can substantially mitigate the bias that can occur from self-deletion (Mivule, 2013)(Chiu-Hsing Weng & Stephen Coad, 2018).
Ultimately, this paper is intended as a guidepost for individuals who are building educational platforms on how they can still maintain strict privacy standards while maintaining unbiased information about their platform. This work also serves to remind academic researchers that data generated from educational platforms may be biased in unanticipated and undetectable ways.
Chiu-Hsing Weng, R., & Stephen Coad, D. (2018). Real-Time Bayesian Parameter Estimation for Item Response Models. Bayesian Analysis, 13(1), 115–137. https://doi.org/10.1214/16-BA1043
Hintze, M., & Emam, K. El. (2017). Comparing the Benefits of Pseudonymization and Anonymization Under the GDPR. Ingentaconnect.Com, (August). Retrieved from https://www.ingentaconnect.com/content/hsp/jdpp/2018/00000002/00000002/art00005
Mivule, K. (2013). Utilizing Noise Addition for Data Privacy, an Overview. Retrieved from http://arxiv.org/abs/1309.3958