Are you ready? Here is all the data Facebook and Google have on you

Google offers an option to download all of the data it stores about you. I’ve requested to download it and the file is 5.5GB big, which is roughly 3m Word documents. This link includes your bookmarks, emails, contacts, your Google Drive files, all of the above information, your YouTube videos, the photos you’ve taken on your phone, the businesses you’ve bought from, the products you’ve bought through Google.

They also have data from your calendar, your Google hangout sessions, your location history, the music you listen to, the Google books you’ve purchased, the Google groups you’re in, the websites you’ve created, the phones you’ve owned, the pages you’ve shared, how many steps you walk in a day…

Curran, D. (2018). Are you ready? Here is all the data Facebook and Google have on you.

I’ve been thinking about all the reasons that support my decision to move as much of my digital life as possible into platforms and services that give me more control over how my personal data is used. Posts like this are really just reminders for me to remember what to include, and why I’m doing this. It’s not easy to move away from Google, Facebook, Amazon, Apple and Twitter but it may just be worth it.

Split learning for health: Distributed deep learning without sharing raw patient data

Can health entities collaboratively train deep learning models without sharing sensitive raw data? This paper proposes several configurations of a distributed deep learning method called SplitNN to facilitate such collaborations. SplitNN does not share raw data or model details with collaborating institutions. The proposed configurations of splitNN cater to practical settings of i) entities holding different modalities of patient data, ii) centralized and local health entities collaborating on multiple task

Source: [1812.00564] Split learning for health: Distributed deep learning without sharing raw patient data

The paper describes how algorithm design (including training) can be shared across different organisations without each having access to each other’s resources.

This has important implications for the development of AI-based health applications, in that hospitals and other service providers need not share raw patient data with companies like Google/DeepMind. Health organisations could do the basic algorithm design in-house with the smaller, local data sets and then send the algorithm to organisations that have the massive data sets necessary for refining the algorithm, all without exposing the initial data and protecting patient privacy.

a16z Podcast: Revenge of the Algorithms (Over Data)… Go! No?

An interesting (and sane) conversation about the defeat of AlphaGo by AlphaGo Zero. It almost completely avoids the science-fiction-y media coverage that tends to emphasise the potential for artificial general intelligence and instead focuses on the following key points:

  • Go is a stupendously difficult board game for computers to play but it’s a game in which both players have total information and where the rules are relatively simple. This does not reflect the situation in any real-world decision-making scenario. Correspondingly, this is necessarily a very narrow definition of what an intelligent machine can do.
  • AlphaGo Zero represents an order of magnitude improvement in algorithmic modelling and power consumption. In other words, it does a lot more with a lot less.
  • Related to this, AlphaGo Zero started from scratch, with humans providing only the rules of the game. So Zero used reinforcement learning (rather than supervised learning) to figure out the same (and in some cases, better) moves than human beings have done over the last thousand years or so).
  • It’s an exciting achievement but shouldn’t be conflated with any significant step towards machine intelligence that transfers beyond highly constrained scenarios.

Here’s the abstract from the publication in Nature:

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

Ontario is trying a wild experiment: Opening access to its residents’ health data

This has led companies interested in applying AI to healthcare to find different ways to scoop up as much data as possible. Google partnered with Stanford and Chicago university hospitals to collect 46 billion data points on patient visits. Verily, also owned by Google’s parent company Alphabet, is recruiting 10,000 people for its own long-term health studies. IBM has spent the last few years buying up health companies for their data, accumulating records on more than 300 million people.

Source: Gershgorn, D. (2018). Ontario is trying a wild experiment: Opening access to its residents’ health data.

I’ve pointed to this problem before; it’s important that we have patient data repositories that are secure and maintain patient privacy but we also need to use that data to make better decisions about patient care. Just like any research project needs carefully managed (and accurate) data, so too will AI-based systems. At the moment, this sees a huge competitive advantage accrue to companies like Google, that can afford to buy that data indirectly by acquiring smaller companies. But even that isn’t sustainable because there’s “no single place where all health data exists”.

This decision by the Ontario government seems to be a direct move against the current paradigm. By making patient data available to via an API, researchers will be able to access only the data approved for specific uses by patients, and it can remain anonymous. They get the benefit of access to enormous caches of health-related information while patient privacy is simultaneously protected. Of course, there are challenges that will need to be addressed including issues around security, governance, differing levels of access permissions.

And that’s just the technical issues (a big problem since medical software is often poorly designed). That doesn’t take into account the ethics of making decisions about individual patients based on aggregate data. For example, if an algorithm suggests that other patients who look like Bob tend not to follow medical advice and default on treatment, should medical insurers deny Bob coverage? These and many other issues will need to be resolved before AI in healthcare can really take off.

The Desperate Quest for Genomic Compression Algorithms

While it’s hard to anticipate all the future benefits of genomic data, we can already see one unavoidable challenge: the nearly inconceivable amount of digital storage involved. At present the cost of storing genomic data is still just a small part of a lab’s overall budget. But that cost is growing dramatically, far outpacing the decline in the price of storage hardware. Within the next five years, the cost of storing the genomes of billions of humans, animals, plants, and microorganisms will easily hit billions of dollars per year. And this data will need to be retained for decades, if not longer.

Source: Pavlichin, D. & Weissman, T (2018). The Desperate Quest for Genomic Compression Algorithms.

Interesting article that gets into the technical details of compression technologies as a way of avoiding the storage problem that comes with the increasing digitalisation of healthcare information.

Associated with this (although not covered in this article) is the idea that we’re moving from a system in which data gathering and storage is emphasised (see any number of articles on the rise of Big Data), towards a system in which data analysis must be considered. Now that we have (or soon will have) all this data, what are we going to do with it? Unless we figure out how to use to improve healthcare then it’s pretty useless.

Eagle-eyed machine learning algorithm outdoes human experts — ScienceDaily

“Human detection and identification is error-prone, inconsistent and inefficient. Perhaps most importantly, it’s not scalable,” says Morgan. “Newer imaging technologies are outstripping human capabilities to analyze the data we can produce.”

Source: Eagle-eyed machine learning algorithm outdoes human experts — ScienceDaily

The point here is that data is being generated faster than we can analyse and interpret it. Big data is not a storage problem, it’s an analysis problem. Yes, we’ve had large sets of data before (think, libraries) but no-one expected a human being to read through, and make sense of, all of it. Now that digital health-related data is being generated by institutions (e.g. CT and MRI scans, EHRs), wearables (e.g. Fitbits, smart contact lenses), embeddables (e.g. wifi enabled pacemakers, insulin pumps) and ingestibles (e.g. bluetooth-enabled smart pills), it’s clear that no single service provider will have the cognitive capacity to analyse and interpret the data flowing from patients at that scale.

As more and more of the data we use in healthcare is digitised, we’ll need algorithmic assistance to filter out and highlight what is important for our specific context (i.e. what does a physio need to know about, rather than what the nurse needs). There will obviously be a role for health professionals in designing and evaluating those algorithms but will we be forward-thinking enough to clearly describe those roles and to prepare future clinicians for them?

Medical images: the only photos not in the cloud – AI Med

We know that AI could prove highly beneficial for radiologists by cutting down on read times and improving accuracy. In addition, AI could be a strong resource for mining large data sets for both individual patient care and global insights. But first, we must access the images.

Today’s traditional hardware, CDs, and PACS (picture archiving communications system) lock data deep inside them and prevent interoperability.

Source: Medical images: the only photos not in the cloud – AI Med

I’d never considered this before but it’s obviously true, and for good reason. Patient anonymity and privary are good reasons to lock down medical images. But it also means that we won’t be able to run the kinds of machine learning algorithms on that data, nor will we be able to compare data from different populations where the medical images sit on different servers in different countries and are regulated by different laws and policies.

If we want to see the kinds of progress being made in other areas of image classification, we may need to reconsider our current policies around sharing patient data. Of course we’ll need consent from patients, as well as a means of ensuring data transfer across systems. This second point alone would be worth pursuing anyway, as it may lead to a set of (open) standards for interoperability between different EHR systems.

As with all things related to machine learning, having access to high fidelity, well-labelled data is key. If we don’t make patient data accessible in some format or another we may find it hard to use AI-based systems in healthcare. This obviously assumes that we want AI-based systems in healthcare in the first place.

Mixed methods research: John Cresswell seminar


For me, mixed methods research (MMR) is about using qualitative and quantitative data to strengthen an argument that is difficult to support with only one type of data. It’s about bringing together the numbers (quantitative) and stories (qualitative) to gain a more complete understanding of the world (research problems and questions). We often think of those two approaches as being separate and distinct, but when combined they produce something greater than the sum of the two parts. Earlier this year we had the opportunity to attend a seminar by John Cresswell & Tim Guetterman. Here are my notes.

Introduction to Mixed Methods Research

Practical uses of mixed methods research:

  • Explaining survey results
  • Exploring the use of new instruments in new situations
  • Confirming quantitative results with qualitative findings (why is it often the quantitative component that comes first; any situations where the quantitative could be used to explain the qualitative results?)
  • Adding qualitative data into experiments
  • Understanding community health research
  • Evaluating programme implementation

What are the major elements of MMR?

  • A methodology (popular way of conducting research)
  • Collecting and analysing quantitative and qualitative data
  • Integrating different sets of data
  • Framing the study within a set of procedures (called mixed methods designs)
  • Being conscious of a philosophical stance and theoretical orientation

Quantitative data collection (closed ended) makes use of instruments, checklists and records. Quantitative data analysis uses numeric data for description, comparison, relating variables

Qualitative data collection (open ended) uses interviews, observations, documents and audio-visual materials. Data analysis revolves around using text and image data for coding, theme development, and then relating themes.

What does “integration” mean? We can do this by merging (using one set of data with another), connecting (using one set of data to explain or build on another), or embedding (quan within qual, or qual within quan) the data.

What is MMR not?

  • Reporting quan and qual data separately (they should be combined)
  • Using informal methods (it is systematic)
  • Simply using the name (it must be rigorous)
  • Collecting either multiple sets of quan or qual data (i.e. not multimethod research, must collect both quan or qual sets of data)
  • Collecting qual data and then quantitatively analysing it (instead of content analysis, collecting both forms of data)
  • Simply considering it an evaluation approach (it is a complete methodology)

Specific benefits of MMR:

  • Quan to qual: make quan results more understandable
  • Qual to quan: understand broader applicability of small-sample qual findings
  • Concurrent: robust description and interpretation of multiple sets of different data

Popular mixed method designs:

  • Basic: convergent (bringing qual and quan data together), explanatory sequential (use one set of data to explain more clearly another set of data), exploratory (use qual data to develop something quantitatively that leads to an intervention design)
  • Advanced: intervention, social justice, multistage evaluation

Research questions related to MMR

  • Convergent design: To what extent to the quan and qual results converge?
  • Explanatory design: In what ways to the qual data help to explain the quan results?
  • Exploratory design: In what ways do the quan results generalise the qual findings?

How do we display quan and qual results together (joint display)? Lots of variation in how both sets of data can be presented. MaxQDA is an application that can be used to analyse and display different sets of data.

How do we publish MMR? Consider publishing the different sets of data in different papers.

How do we link writing structure to design? Writing about and publishing mixed methods research may require different approaches to article structure and style of writing.

The importance of qualitative research in mixed methods

Key features of qual research:

  • Follows the scientific method
  • Listening to participant views
  • Asking open ended questions
  • Build understanding based on participant views
  • Developing a complex understanding of the problem
  • Go to the setting to gather data
  • Be ethical
  • Analyse the data inductively – let the findings emerge
  • Write in a user-friendly way
  • Include rich quotes
  • Researcher presence in the study (reflexivity)

Types of problems that qual research is suited to:

  • A need to explore a context
  • When it is important to listen
  • Unusual / different culture
  • Don’t know the questions to ask
  • Understanding a process
  • Need to tell a story

How do our backgrounds inform the way we interpret the world? There is an element of reflexivity and an understanding that data interpretation is dependent on our individual personal and professional contexts.

Writing a good qual purpose statement:

  • Single sentence, often in the form of “The purpose of this study…”
  • A focus on one central phenomenon
  • “Qualitative words” e.g. explore, describe, understand, develop
  • Includes participants and setting

Understanding a central phenomenon:

  • Quan: explaining or predicting variables
  • Qual: understanding a central phenomenon

Data collection:

  • Sampling (purposeful)
  • Site selection (gatekeepers, permissions)
  • Recruitment (incentives)
  • Types of data (observation, interview, public/private documents, audio-visual)

Interview procedures:

  • Create a protocol
  • 5-7 open ended questions (first question is easy to answer e.g. participant role or experience; last question could be “Who else should I speak to in order to get more information about this?”)
  • Allows the participant to create options for responding
  • Participants can voice their experiences and perspectives
  • Record and transcribe for analysis


  • Observation protocol
  • Descriptive notes (portrait of informant, setting, event) and reflective notes (personal reflections, insight, ideas, confusion, hunches, initial interpretation)
  • Decide on observational stance (e.g. outsider, participant, changing roles)
  • Enter site slowly
  • Conduct multiple observations
  • Summarise at the end of each observation

Types of audio-visual material:

  • Physical trace evidence
  • Videotape or film a social situation, individual or group
  • Examine website pages
  • Collect sounds
  • Collect email or social network messages
  • Examine favorite possessions or ritual objects

How to code data:

  • Read through the data (many pages of text)
  • Divide text into segments (many segments of text)
  • Label segments of information with codes (30-40 codes)
  • Reduce overlap and redundancy (reduce codes to 20)
  • Collapse codes into themes (reduce codes to 5-7 themes)

A good qual researcher can identify fine detail but also step back and see the larger themes

How to write a theme passage:

  • Use themes as headings
  • Use codes to build evidence for themes
  • Use quotes and sources of information to demonstrate themes

Writing up the qual study:

  • Description
  • 5-7 themes
  • Use codes and quotes to support themes
  • Tell a good story

Five approaches to qual research:

  • Narrative (comes out of literature)
  • Phenomenology (psychology)
  • Grounded theory (sociology)
  • Ethnography (anthropology)
  • Case study
  • Can also include discourse analysis, participatory approaches

Ethical issues:

  • Respect the site, develop trust, anticipate the extent of the disruption
  • Avoid deceiving participants, discuss purpose
  • Respect potential power imbalances
  • Consider incentive for participants

MAXApp is a mobile app for collecting data on Android and iOS devices:

  • Take photos
  • Write memos
  • Audio recording
  • Location data (geotagging)
  • How is this different to something like Evernote? If you’re already using MAXQDA, it offers integration with the desktop client. If you use another data analysis package, then MAXApp may not be as useful.

Relevant readings