We spend a lot of time focusing on the content of messaging systems as a means of identifying people but it looks like the metadata encoded alongside the content may be just as important when it comes to de-anonymising the data. This wasn’t always a problem because it’s hard to analyse multivariate relationships in large sets of data, especially when we don’t really know what we’re looking for. It turns out that machine learning algorithms are very good at finding patterns that we don’t have to explicitly define, which means we need to think carefully about what is included in the data we share.
This may also have implications for the publication of data sets that researchers are under pressure to include in their final publications. How long before we need to ensure that metadata – as well as names – are scrubbed from the data sets?
We also found that data obfuscation is hard and ineffective for this type of data: even after perturbing 60% of the training data, it is still possible to classify users with an accuracy higher than 95%. These results have strong implications in terms of the design of metadata obfuscation strategies, for example for data set release, not only for Twitter, but, more generally, for most social media platforms.