Approval for Google search history analyzer

madprime · October 18, 2018, 4:51pm

I’m posting a review request for the Google search history analyzer, which is applying for project approval on Open Humans. This project is run by Athina Tzovara with support from Bastian Greshake Tzovaras.

Should this project be visible and available for all Open Humans members to join?

Please vote Approve or Deny, and/or comment.

Quick links

Activity page: https://www.openhumans.org/activity/google-search-history-analyzer/
Project review guide: Project Review Guide
Project guidelines: https://www.openhumans.org/community-guidelines/#project

Project info

Title: Google search history analyzer
Managed by: Athina Tzovara with support from Bastian Greshake Tzovaras
Description: Here you can upload your Google search history data. Once uploaded, you can use the Personal Data Notebooks to analyze your searches. For example, you can compute your top searches per month, or obtain graphs of the top co-occurring terms within your searches. The data that you request from Google will contain your search terms, along with the date and time when you did these searches, and the websites that you visited after each search. By default, your data are kept private, so that only you have access to your data and we will not use or access your data without your explicit consent. If you want, you may also choose to share your data with other projects or to make them public.
Project website: https://goo-searches-analyzer.herokuapp.com/
Connections: 3 members
Data received: None
Data added: Google take-out archive of search data

beau · October 18, 2018, 5:50pm

my thoughts on this one are that I am in favor of approval but I think it would benefit from a security review as well as additional language about what might appear in your Google search history–this is one that I would probably not share myself because my search history is probably 1. very identifying, 2. embarrassing (searches for medical symptoms, for example)… and I think potential participants would benefit from a reminder to think about what is in their history before sharing

annoyingsquid · October 18, 2018, 6:02pm

Thanks for the feedback! That’s a good idea, I’ll definitely add some more concrete examples of what might appear in someone’s Google search history. Do you have any other specific points in mind that we should add regarding the security review?

beau · October 18, 2018, 6:15pm

if the source is public I am happy to help review (assuming that it uses the Python/Django framework); if it’s not public I would suggest verifying that the security docs for the framework being used have been read and followed (for Django that would be this document and this deployment checklist)

gedankenstuecke · October 18, 2018, 7:48pm

The code is public here, as it’s just a standard heroku deployment of the oh_data_uploader.

beau · October 18, 2018, 10:11pm

ah, cool!

two suggestions:

SECRET_KEY = os.getenv('SECRET_KEY', 'whopsthereshouldbeone')

I would change this to:

from env_tools import env_to_bool, get_enforcement_context

# at top of file
require_env, enforce_required_envs = get_enforcement_context()

if ON_HEROKU:
    SECRET_KEY = require_env('SECRET_KEY')
else:
    SECRET_KEY = os.getenv('SECRET_KEY', 'whopsthereshouldbeone')

# Must come last to catch all missing environment variables
enforce_required_envs()

otherwise someone could make the mistake of deploying to Heroku with the default SECRET_KEY, which could be bad

secondly, i would change this:

DEBUG = False if os.getenv('DEBUG', '').lower() == 'false' else True

to:

DEBUG = env_to_bool('DEBUG') and not ON_HEROKU

(env_to_bool is also part of env_tools)

this will prevent turning DEBUG on in production, which is dangerous, especially if forgotten about!

I noticed that this project is deployed with DEBUG=true, which can be verified by visiting a URL that does not exist and seeing this message:

You're seeing this error because you have DEBUG = True in your Django settings file. Change that to False, and Django will display a standard 404 page.

gedankenstuecke · October 18, 2018, 11:01pm

Those are good points, made a PR for those changes!

madprime · October 19, 2018, 12:29am

I also liked Beau’s suggestions regarding the things that might appear in the search history, and some other thoughts – in general, trying to increase the clarity that potential users understanding things.

Or to put it differently, minimize the chance someone says: “I didn’t know X! I wouldn’t have done this if you told me X!”

Some thoughts on this…

more clarity on the front page: I’d like more information on the front page, not in an “about” page. The chances of someone looking there are low. I think a front page should hit highlights of info mandated by project guidelines (but not too wordy) & direct someone to an “about” page to read details. In particular, “what’s in the data”, “who is running it”, and “what it will do”.
language & template re-use isn’t great: Because the site was adapted from an existing repository, there’s language that’s copied and overlapping with other projects. It’s particularly a problem when language/format things overlap with projects run by Open Humans Foundation, but it’s also just generally not a good thing – a user may misunderstand that “standard language” implies some “standard operation”. These templates are great, but it would be much clearer if each project tried to have a unique layout and language.
(just ideas) data processing improvements: I’m interested in potential improvements to the code itself! I’m glad the source is shared! I don’t think approval should delay based on this but I can imagine things that would make me more comfortable with using it – e.g. some data filtering/processing before storage.

To be fair, I think other projects may suffer from issues as well regarding clarity and language re-use. I’d like them all improved. Buuuuut because this is a particularly sensitive data source, I think it’s especially important to do a good job.

PS - oh, good catch on the DEBUG = True @beau! … I should check other projects for that

beau · October 19, 2018, 4:59pm

re: the dangers of DEBUG = True, I had a hunch that I might see the debug page that shows all of the environment variables if I uploaded a zero byte file that ended in .zip; which I tested, so you might add some error handling around malformed file uploads too.

Django does an admirable job of masking any environment variable containing “SECRET” or “PASSWORD”, so those are safe… but it does give an attacker the host, database name, and user name of your Django database, which they could then mount a brute force attack against (or use a Postgres exploit if a zero-day comes out). I did verify that the database seemed to be accessible from any IP address; it would be better if it was only accessible from inside Heroku.

gedankenstuecke · October 19, 2018, 11:24pm

Yep, I just pushed the latest master branch of oh_data_uploader to heroku for the project discussed in the thread. That should fix the security issues @beau mentioned.

ArloJamesBarnes · October 21, 2018, 9:22pm

Minor wording suggestion: one of the steps says “pick the delivery method and compression of your choice”, but later assumes ZIP. Perhaps use “compressed file (ZIP or TGZ)” instead.

annoyingsquid · November 6, 2018, 12:22am

Thanks @ArloJamesBarnes ! I have updated the wording for the delivery method, good point!

annoyingsquid · November 6, 2018, 12:33am

Thanks @madprime and @beau for the very helpful suggestions! Very good points and I’m really glad that this review process is helping in improving safety issues around our project! I think that @gedankenstuecke has already addressed issues regarding the code of this project, and I have finally just found some time to work on the wording and web page.

In particular, I have now added more information in the front page, giving a more detailed overview of the collected data and nature of this project. I have included examples of sensitive google searches, to make sure that users are aware that their searches may contain private information.

I have also re-written some parts of the ‘About’ page that I think were taken from other projects regarding the collected data. I have now formatted them in a way that I hope is putting more emphasis on the google search history data and then, at a second step mentions the data that are being collected through heroku or OH (IP addresses etc).

Please let me know how these look to you and if you have any other ideas for further improvement!

Many thanks to everyone!

madprime · November 13, 2018, 9:10pm

Sorry I wasn’t able to get back to this sooner!

1. what’s in the data: This is now covered in a lot of detail on the front page. But it’s wordy; I think it could be further improved to be more clear and concise. Normally I wouldn’t push further on this issue, but given the sensitivity of the data (which @beau also noted) I think further improvement would be good. See below for a suggestion?

2. who is running it: This doesn’t seem to be on the front page?

3. what it will do: this seems to be fine, I like the note about the notebooks!

A bullet point list might help communicate information clearly and concisely? For example…

This data contains:

every Google search you’ve performed
the exact characters for each search
the date and time of each search
any links you followed as a result of each search

This data is sensitive because it’s likely to have things like…

exact names of people you know – which can be used to identify you
places you might have been (e.g. a search for “San Diego zoo hours”)
your shopping (e.g. a search for “LED lamp recommendations”)
your health (e.g .a search for “how to remove earwax”)

beau · November 13, 2018, 9:49pm

the security fixes look great, and I verified that DEBUG is now not set.

I also like Mad’s suggestion for a bulleted list!

annoyingsquid · November 13, 2018, 10:33pm

Thanks @madprime and @beau ! I have just added bulleted lists, good idea, and I’ve also added contact information on the front page.

Great that the security fixes are set!

annoyingsquid · November 13, 2018, 10:33pm

@madprime : What are the next steps now?

madprime · November 13, 2018, 10:38pm

Thanks! I’m in favor of approve now.

@beau sounds like the same?

And I’d like to wait a bit to see if anyone else weighs in, otherwise I’ll plan to wrap it up tomorrow afternoon as approved.

beau · November 13, 2018, 11:29pm

yes, I’m also in favor if approval

wolfgang8741 · November 14, 2018, 2:37am

A few more points to consider as the search history can have far more broad information in them. (making this quick to get the response out for consideration as I do not have a moment to browse the code in depth)

Expanding on madprime’s statements on explicit warnings of what people might disclose. Going further to note accidentally entered credit card information, privacy implications for one’s social network if users have searched for anyone else with private information ie email addresses (not just one’s own information may be at risk), or someone else has searched when this user is logged in. The impacts of search are possibly more broad than the individual.

Second is question of what tools have been integrated to help users detect sensitive information before release? Social security number and phone number filters? The list could be extensive, but given that once released it in the public there should be some tools there to help prevent release of sensitive information or at least provide a possible hint to what is being released in a summary and a way to explore that. Most importantly a summary displayed (saying this is what was detected and link to code - as not all things will be thought of) with confirmation acknowledging this is being made public.