Overview
Reading wants a service for supporting reading lists (lists of wiki articles) on the web interface, to synchronize list data between devices and to provide functionality that requires a backend. This is a Q4 goal for Reading Infrastructure. The primary customers of the project are the iOS, Android and Reading Web teams.
Reading Lists will be a REST service that provides CRUD operations for lists and list items (page titles), the ability to sort lists in arbitrary ways, and support for efficiently syncing list changes. (At a later stage, support for searching the content of the listed pages and for push notifications is also planned. That is not included in this RfC.) Lists are only available to authenticated users and they are private data. These lists are NOT wiki specific and might contain pages from any number of wikis.
The goal of this RfC is to decide on the MediaWiki/RESTBase issue and make sure there are no objections to the high-level implementation plan.
Infrastructure
(An alternative infrastructure proposal has been removed from here after the decision made, and is preserved in T164990#3350956.)
Consumers want a REST API and adding one to MediaWiki would be a prohibitively large project. Writing an external service that can interface with authentication data directly would be likewise too complicated. Thus, the service will have to involve parts in MediaWiki and RESTBase both.
The service is implemented as action API module(s) in a MediaWiki extension. Summaries are fetched from RESTBase. Requests are proxied through a minimal RESTBase-based service to add REST semantics and versioning.
(Alternatively, fetching the summaries might be handled in the RESTBase part if that turns out to be significantly more performant.)
Pros:
- Can reuse existing MediaWiki logic and infrastructure (auth, DB handling, API helpers etc)
- Can take advantage of extra action API functionality if needed (batching, generators, OAuth etc)
- All the elements (creating a MediaWiki extension, writing action API modules, creating a RESTBase service for wrangling MediaWiki API calls) are well understood, achievable, and easy to estimate.
- Can be more easily reused by third parties (if they don’t need REST format)
Cons:
- Two codebases to maintain
API syntax
Opt in/out
Users must opt in explicitly before they can use the API, to spare storage space (of default lists) and bandwidth (of push notifications). Users can opt out again if they want to delete their data.
- POST /lists/reading/setup: enables reading lists for the user (creates default list, sets up push notifications). A prerequisite for everything else.
- POST /lists/reading/teardown: disables reading lists, deletes data
List CRUD
A list consists of an id, a name, a creation date, some metadata (description, image etc), and an array of entries.
- GET /lists/reading/: get all lists for the user.
- POST /lists/reading: creates a new list
- PUT /lists/reading/{list_id}: updates a list
- DELETE /lists/reading/{list_id}: deletes a list
List entry CRUD
An entry consists of an id, a project (wiki domain), a title and a creation date. The summary of the page (as given by the summary API) is also included.
- GET /lists/reading/{list_id}/entries: gets all entries.
- POST /lists/reading/{list_id}/entries: adds a new entry to the list
- DELETE /lists/reading/{list_id}/entries/{entry_id}: deletes an entry
Sorting
Sorting works by getting/setting an array of all list / list entry ids. The number of lists per user / entries per list will be capped at some high but reasonable number so these arrays cannot grow too long.
- GET /lists/reading/list_order: get order of lists
- PUT /lists/reading/list_order: set order of lists
- GET /lists/reading/{list_id}/entry_order: get order of list entries
- PUT /lists/reading/{list_id}/entry_order: set order of list entries
Misc
- GET /lists/reading/pages/{project}/{title}: gets which lists a the given page is a member of
- GET /lists/reading/changes/since/{date}: gets all lists and entries which changed since the given date (for device sync). Returns a marker for deleted lists/entries as well.
Action API syntax
(in case option 1 is chosen above)
- list=readinglists to get all lists of the current user
- rlprop=entries to enable/disable returning the entries together with the list
- rllimit, rlentrylimit
- list=readinglists&rllist={list_id} to get the entries of a single list
- list=readinglists&rlproject={project}&rltitle={title} to get the lists containing a given page
- list=readinglistchanges&rlcsince={date} to get changed/deleted entries/lists
- action=readinglists&command=[setup|teardown|create|update|delete|createentry|deleteentry|order] for all the write operations on lists. (This is slightly against action API conventions which would put all "commands" as separate actions, which IMO would make the help interfaces unnecessarily hard to use.)
- action=readinglistentries&command=[create|delete|order] for all the write operations on list entries.
Data storage
Reading lists contain primary data (cannot be regenerated from other sources and losing it would have a major UX impact), and data needs to be fetched based on criteria other than the id (e.g. all lists containing a given page, all entries which have changed after a given date) so MariaDB will be used:
reading_list table:
- rl_id
- rl_user_id: central ID of the user
- rl_is_default: flag to tell apart the initial list from the rest, for UX purposes and to forbid deleting it
- rl_name: human-readable name, non-unique
- rl_description
- various other metadata: color, image, icon (TBD: should these be in a separate key-value table to make adding/removing metadata types easier?)
- rl_date_created
- rl_date_updated
- rl_deleted (we need soft-delete for sync)
- indexes: (rl_user_id, rl_date_updated) for the /since/ route, (rl_user_id, rl_deleted) for getting all
reading_list_entry table:
- rle_id
- rle_list_id
- rl_user_id: central ID of the user (denormalized for the benefit of the /pages/ route)
- rle_project: wiki project domain (TBD: use a lookup table / some other way of compression?)
- rle_title: page title (can't easily use page ids due to the cross-wiki nature of the project. Also, page ids don't age well when content is deleted/moved.)
- rle_date_created
- rle_date_updated
- rle_deleted
- indexes: (rle_list_id, rle_date_updated) for the /since/ route and all entries in a list, (rle_user_id, rle_project, rle_title) for the /pages/ route
reading_list_sort_index table:
- rlsi_rl_id
- rlsi_index
- indexes: (rlsi_index)
Deleted lists / entries will be purged by a periodic job when they are older than X days (so that other devices have time to sync the deletion).
TBD: are all the indexes worth it? We don't expect a single user to have lots of lists / a single list lots of items
Schedule and cross-dependencies
Q4: set up an MVP on the beta cluster.
Dependencies: Services and Tech Ops for consulting, Security for review,
Q1: integrate with apps, deploy to production. Add download sizes (will be handled by the page summary API).
Dependencies: Android/iOS for testing and iterating, Services for download sizes.
Q2: improve app integration, add search.
Dependencies: Discovery (Backend) for figuring out search (possibly on multiple wikis, in a limited set of pages). Android/iOS for continued iteration.
Usage projections
Android has a similar feature (except you don't need to be logged in), used by 10% of users. That results in 1500 list write operations per hour (the peak is around 2500). There are 6M Android users a month, and <1M iOS users. There are about 300K active logged-in web users a month on enwiki (counting user_touched - not very reliable since we have extended login durations to a year), and the churn for that group is nearly 100%. Enwiki tends to be about half of everything.
Put together, that suggests somewhere around 0.1 writes/sec (maybe add a magnitude to that in case throngs of readers register to use the new feature - this will be the first time we heavily promote registration to readers).
Syncing is done via a push model, so read volume should be below write volume.
TBD - apps store lists locally and will only use the API occasionally, for syncing. Can web do that? Otherwise, we might have way more reads.
Average number of pages per user is ~25 on Android, so assuming usage level is same across all device types and login rate does not grow exponentially, we can expect ~2.5M rows to be actively used at one time. Assuming 100% user churn and 200 byte per row that's 6GB storage space every year. (Keep in mind this is a very handwavy guesstimate as we have no way to know what fraction of users will actually use lists / how many readers will register an account just because of this.)
All of these estimates refer to Q4 2017-18, which is when the feature is enabled on the web if all goes according to plan. Before then, it will only be used by apps which will probably result in significantly smaller usage.
See also
For previous conversations and far more detail see the Technical Plan, T164805, T164808, T164236.
The (very different) former proposal for reading lists can be found at T128602: RFC: Backend for synchronized data from Wikipedia mobile apps.