Are you ready? Here is all the data Facebook and Google have on you

Google offers an option to download all of the data it stores about you. I’ve requested to download it and the file is 5.5GB big, which is roughly 3m Word documents. This link includes your bookmarks, emails, contacts, your Google Drive files, all of the above information, your YouTube videos, the photos you’ve taken on your phone, the businesses you’ve bought from, the products you’ve bought through Google.

They also have data from your calendar, your Google hangout sessions, your location history, the music you listen to, the Google books you’ve purchased, the Google groups you’re in, the websites you’ve created, the phones you’ve owned, the pages you’ve shared, how many steps you walk in a day…

Curran, D. (2018). Are you ready? Here is all the data Facebook and Google have on you.

I’ve been thinking about all the reasons that support my decision to move as much of my digital life as possible into platforms and services that give me more control over how my personal data is used. Posts like this are really just reminders for me to remember what to include, and why I’m doing this. It’s not easy to move away from Google, Facebook, Amazon, Apple and Twitter but it may just be worth it.

Delete All Your Apps

A good question to ask yourself when evaluating your apps is “why does this app exist?” If it exists because it costs money to buy, or because it’s the free app extension of a service that costs money, then it is more likely to be able to sustain itself without harvesting and selling your data. If it’s a free app that exists for the sole purpose of amassing a large amount of users, then chances are it has been monetized by selling data to advertisers.

Koebler, J. (2018). Delete all your apps.

This is a useful heuristic for making quick decisions about whether or not you should have that app installed on your phone. Another good rule of thumb: “If you’re not paying for the product then you are the product.” Your personal data is worth a lot to companies who are either going to use it to refine their own AI-based platforms (e.g. Google, Facebook, Twitter, etc.) or who will sell your (supposedly anonymised) data to those companies. This is how things work now…you give them your data (connections, preferences, brand loyalty, relationships, etc.) and they give you a service “for free”. But as we’re seeing more and more, it really isn’t free. This is especially concerning when you realise how often your device and apps are “phoning home” with reports about you and your usage patterns, sometimes as frequently as every 2 seconds.

On a related note, if you’re interested in a potential technical solution to this problem you may want to check out Solid (social linked data) by Tim Berners-Lee, which will allow you to maintain control of your personal information but still share it with 3rd parties under conditions that you specify.


Split learning for health: Distributed deep learning without sharing raw patient data

Can health entities collaboratively train deep learning models without sharing sensitive raw data? This paper proposes several configurations of a distributed deep learning method called SplitNN to facilitate such collaborations. SplitNN does not share raw data or model details with collaborating institutions. The proposed configurations of splitNN cater to practical settings of i) entities holding different modalities of patient data, ii) centralized and local health entities collaborating on multiple task

Source: [1812.00564] Split learning for health: Distributed deep learning without sharing raw patient data

The paper describes how algorithm design (including training) can be shared across different organisations without each having access to each other’s resources.

This has important implications for the development of AI-based health applications, in that hospitals and other service providers need not share raw patient data with companies like Google/DeepMind. Health organisations could do the basic algorithm design in-house with the smaller, local data sets and then send the algorithm to organisations that have the massive data sets necessary for refining the algorithm, all without exposing the initial data and protecting patient privacy.

Medical data: who owns it and what can be done to it?

…most states in the US do not have law to confer specific ownership of medical data to patients, while others put the rights on hospitals and physicians. Of all, only New Hampshire allows patients to legally own their medical records.

Source: Medical data: who owns it and what can be done to it?

A short article that raises some interesting questions. My understanding is that the data belongs to the patient and the media on which the data is stored belongs to the hospital. For example, I own the data generated about my body but the paper folder or computer hard drive belongs to the hospital. That means I can ask the hospital to photocopy my medical folder and give me the copy (or to email me an exported XML data file from whatever EHR system they use) but I can’t take the folder home when I’m discharged.

Things are going to get interesting when AI-based systems are being trained en masse using historical medical records where patients did not give consent for their data to be used for algorithmic training. I believe that the GDPR goes some way towards addressing this issue by stating that, “healthcare providers do not have to seek prior permission from patients to use their data, as long as they observe the professional secrecy act to not identify patients at the individual level”.

Mozilla’s Common Voice project

Any high-quality speech-to-text engines require thousands of hours of voice data to train them, but publicly available voice data is very limited and the cost of commercial datasets is exorbitant. This prompted the question, how might we collect large quantities of voice data for Open Source machine learning?

Source: Branson, M. (2018). We’re intentionally designing open experiences, here’s why.

One of the big problems with the development of AI is that few organisations have the large, inclusive, diverse datasets that are necessary to reduce the inherent bias in algorithmic training. Mozilla’s Common Voice project is an attempt to create a large, multilanguage dataset of human voices with which to train natural language AI.

This is why we built Common Voice. To tell the story of voice data and how it relates to the need for diversity and inclusivity in speech technology. To better enable this storytelling, we created a robot that users on our website would “teach” to understand human speech by speaking to it through reading sentences.

I think that voice and audio is probably going to be the next compter-user interface so this is an important project to support if we want to make sure that Google, Facebook, Baidu and Tencent don’t have a monopoly on natural language processing. I see this project existing on the same continuum as OpenAI, which aims to ensure that “…AGI’s benefits are as widely and evenly distributed as possible.” Whatever you think about the possibility of AGI arriving anytime soon, I think it’s a good thing that people are working to ensure that the benefits of AI aren’t mediated by a few gatekeepers whose primary function is to increase shareholder value.

Most of the data used by large companies isn’t available to the majority of people. We think that stifles innovation. So we’ve launched Common Voice, a project to help make voice recognition open and accessible to everyone. Now you can donate your voice to help us build an open-source voice database that anyone can use to make innovative apps for devices and the web. Read a sentence to help machines learn how real people speak. Check the work of other contributors to improve the quality. It’s that simple!

The datasets are openly licensed and available for anyone to download and use, alongside other open language datasets that Mozilla links to on the page. This is an important project that everyone should consider contributing to. The interface is intuitive and makes it very easy to either submit your own voice or to validate the recordings that other people have made. Why not give it a go?

Ontario is trying a wild experiment: Opening access to its residents’ health data

This has led companies interested in applying AI to healthcare to find different ways to scoop up as much data as possible. Google partnered with Stanford and Chicago university hospitals to collect 46 billion data points on patient visits. Verily, also owned by Google’s parent company Alphabet, is recruiting 10,000 people for its own long-term health studies. IBM has spent the last few years buying up health companies for their data, accumulating records on more than 300 million people.

Source: Gershgorn, D. (2018). Ontario is trying a wild experiment: Opening access to its residents’ health data.

I’ve pointed to this problem before; it’s important that we have patient data repositories that are secure and maintain patient privacy but we also need to use that data to make better decisions about patient care. Just like any research project needs carefully managed (and accurate) data, so too will AI-based systems. At the moment, this sees a huge competitive advantage accrue to companies like Google, that can afford to buy that data indirectly by acquiring smaller companies. But even that isn’t sustainable because there’s “no single place where all health data exists”.

This decision by the Ontario government seems to be a direct move against the current paradigm. By making patient data available to via an API, researchers will be able to access only the data approved for specific uses by patients, and it can remain anonymous. They get the benefit of access to enormous caches of health-related information while patient privacy is simultaneously protected. Of course, there are challenges that will need to be addressed including issues around security, governance, differing levels of access permissions.

And that’s just the technical issues (a big problem since medical software is often poorly designed). That doesn’t take into account the ethics of making decisions about individual patients based on aggregate data. For example, if an algorithm suggests that other patients who look like Bob tend not to follow medical advice and default on treatment, should medical insurers deny Bob coverage? These and many other issues will need to be resolved before AI in healthcare can really take off.

How to ensure safety for medical artificial intelligence

When we think of AI, we are naturally drawn to its power to transform diagnosis and treatment planning and weigh up its potential by comparing AI capabilities to those of humans. We have yet, however, to look at AI seriously through the lens of patient safety. What new risks do these technologies bring to patients, alongside their obvious potential for benefit? Further, how do we mitigate these risks once we identify them, so we can all have confidence the AI is helping and not hindering patient care?

Source: Coiera, E. (2018). How to ensure safety for medical artificial intelligence.

Enrico Coiera covers a lot of ground (albeit briefly) in this short post:

  • The prevalence of medical error as a cause of patient harm
  • The challenges and ethical concerns that are inherent in AI-based decision-making around end-of-life care
  • The importance of high-quality training data for machine learning algorithms
  • Related to this, the challenge of poor (human) practice being encoded into algorithms and so perpetuated
  • The risk of becoming overly reliant on AI-based decisions
  • Limited transferability when technological solutions are implemented in different contexts
  • The importance of starting with patient safety in algorithm decision, rather than adding it later

If you use each of the points in the summary above, there’s enough of a foundation in this article to really get to grips with some of the most interesting and challenging areas of machine learning in clinical practice. It might even be a useful guide to building an outline for a pretty comprehensive research project.

For more thoughts on developing a research agenda in related topics, see: AMA passes first policy guidelines on augmented intelligence.

Note: you should check out Enrico’s Twitter feed, which is a goldmine for cool (but appropriately restrained) ideas around machine learning in clinical practice.

Fairness matters: Promoting pride and respect with AI

We’re creating an open dataset that collects diverse statements from the LGBTIQ+ community, such as “I’m gay and I’m proud to be out” or “I’m a fit, happy lesbian that has just retired from a wonderful career” to help reclaim positive identity labels. These statements from the LGBTIQ+ community and their supporters will be made available in an open dataset, which coders, developers and technologists all over the world can use to help teach machine learning models how the LGBTIQ+ community speak about ourselves.

Source: Fairness matters: Promoting pride and respect with AI

It’s easy to say that algorithms are biased, because they are. It’s much harder to ask why they’re biased. They’re biased because of many reasons but one of the biggest contributors is that we simply don’t have diverse and inclusive data sets to train them on. Human bias and prejudice is reflected in our online interactions; they way we speak to each other on social media, the things we write about on blogs, the videos we watch on YouTube, the stories we share and promote. Project respect is an attempt to increase the set of inclusive and diverse training data for better and less biased machine learning.

Algorithms are biased because human beings are biased, and the ways that those biases are reflected back to us may be why we find them so offensive. Maybe we don’t like machine bias because of what it says about us.

Separating the Art of Medicine from Artificial Intelligence

Writing a radiology report is an extreme form of data compression — you are converting around 2 megabytes of data into a few bytes, in effect performing lossy compression with a huge compressive ratio.

Source: Separating the Art of Medicine from Artificial Intelligence

For me, there were a few useful takeaways from this article. The first is that data analysis and interpretation is a data compression problem.  The trick is to find a balance between throwing out information that isn’t useful and maintaining the relevant message during the processing. Consider the patient interview, where you take 15-20 minutes of audio data (about  10-15 MB using mp3 compression) and convert it to about a page of text (a few kilobytes at most). The subjective decisions we make about what information to discard and what to highlight have a real impact on our final conclusions and management plans.

Human radiologists are so bad interpreting chest X-rays and/or agreeing what findings they can see, that the ‘report’ that comes with the digital image is often either entirely wrong, partially wrong, or omits information.

This is not just a problem in radiology. I haven’t looked for any evidence of this but from personal experience I have little doubt that the inter and intra-rater reliability of physiotherapy assessment is similarly low. And even in cases where the diagnosis and interventions are the same, there would likely be a lot of variation in the description and formulation of the report. And this links to the last thing that I found thought-provoking:

…chest X-ray reports were never intended to be used for the development of radiology artificial intelligence. They were only ever supposed to be an opinion, an interpretation, a creative educated guess…A chest X-ray is neither the final diagnostic test nor the first, it is just one part of a suite of diagnostic steps in order to get to a clinical end-point.

We’re using unstructured medical data captured in a variety of contexts, to train AI-based systems but the data were never obtained, captured or stored in a system that was designed for that purpose. The implication is that the data we’re using to train medical AI simply isn’t fit for purpose. As long as we don’t collect the metadata (i.e. the contextual information “around” a condition), and continue using poorly labeled information and non-standardised language, we’re going to have problems with training machine learning algorithms. If we want AI-based systems to be anything more than basic triage then these are important problems to address.

You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information

We spend a lot of time focusing on the content of messaging systems as a means of identifying people but it looks like the metadata encoded alongside the content may be just as important when it comes to de-anonymising the data. This wasn’t always a problem because it’s hard to analyse multivariate relationships in large sets of data, especially when we don’t really know what we’re looking for. It turns out that machine learning algorithms are very good at finding patterns that we don’t have to explicitly define, which means we need to think carefully about what is included in the data we share.

This may also have implications for the publication of data sets that researchers are under pressure to include in their final publications. How long before we need to ensure that metadata – as well as names – are scrubbed from the data sets?

We also found that data obfuscation is hard and ineffective for this type of data: even after perturbing 60% of the training data, it is still possible to classify users with an accuracy higher than 95%. These results have strong implications in terms of the design of metadata obfuscation strategies, for example for data set release, not only for Twitter, but, more generally, for most social media platforms.

Source: [1803.10133v1] You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information