Best practices for ongoing "API to file data"?

I recently noticed the OH-maintained Fitbit data import isn’t querying the current sleep API endpoint (it seems to be using a now undocumented/deprecated endpoint) – I wrote that up here: https://github.com/OpenHumans/oh-fitbit-integration/issues/26

This raises some general questions about the challenges of recording data, especially accumulating it over time, in the face of changing APIs.

Challenges with API data

  1. A new API endpoint exists with new/different data.
  2. An old endpoint is no longer available.
  3. Or, an old endpoint works but is now undocumented/deprecated. (And it might cease.)
  4. A discontinued endpoint provided data not available elsewhere.
  5. API data isn’t available for the indefinite past. e.g. the only data from the last 30 days is available.
  6. API data has ongoing collection.
  7. Re-querying past data may be expensive to do.

Questions

  1. When to document some sense of “version”? How?
  2. Does a new version remove representation of discontinued, undocumented, or deprecated data?
  3. If it is possible to rewrite past data into a new version, should that be done?

Specific case: Fitbit

With the fitbit data, I think challenge #1 is true (there’s new data). I think #3 remains true, not sure if #4 is true. #6 and #7 are true (data is collected in an ongoing manner, appending to files, and it’s “expensive” to collect), but #5 is not true (all past data remains available).

Qs and potential best practices

Questions we face, and some reflection on potential “best practices”…

Q1: version?

I don’t think we had a version. The format itself might be an issue? All data is in a single JSON, fitbit-data.json, which means you can’t have historic files with one version and new ones in another.

Does this mean it’s generally “best practices” to split files up for historic data (e.g. one file per month), not blob it into a single file? This would enable past files to have other versions, vs. new versions for data going forward.

And: open question about how to represent version. (Metadata on open humans? in the file itself somehow? note that JSON does not have the ability for comments, so the version is in the file data proper…)

Q2: remove past data representation in new version?

Generally I’d say: yes? I’m not a fan of keeping empty fields around when we can’t fill them, let old data exist with old versions and new data exist with new versions.

Q3: rewrite past data into new version?

I think that would be ideal

BUT

It’s a lot of work to do this stuff – ugh. Nobody pays us for this so… no. We shouldn’t. The thought experiment was fun while it lasted.

EASIER

I write this last section knowing how the operations are working under the hood, knowing how much work the speculation above would actually be.

While it would be lovely to do versions and split files, I’m much more interested in minimizing our labor.

Tagging @mcescalante for his own thoughts. Here’s a lossless approach I think might be a lot easier.

  1. Add a new JSON key to fitbit_urls that stores all the relevant data from the new endpoint
  2. Fill it in with data into the past.
  3. Leave the old ones in place, and let them keep updating if they can. (Don’t care if it’s redundant.)

I think this is “good enough” as I expect it’s lossless, gets the new data, and allows cleanup at some later date (which I don’t expect to happen, but at least it hasn’t been rendered impossible).

I think a date-canonical versioning is fine. If you fix the fitbit API in october fitbit_2019.10 is a very reasonable way to keep track of when code was last touched. If it’s truly a minor update for code, where you are fixing something truly minor fitbit_2019.10a or fitbit_2019.10_1.