Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...)
Closed, ResolvedPublic

Description

It would be good to have somewhere central that tabular datasets could be uploaded to -- eg the data to be plotted by the graphs extension.

The data should be uploadable and exportable in a variety of standard formats -- eg CSV, TSV, JSON, XML -- and stored in a format-neutral way.

Wikidata is great for single values, or values to be scattered across multiple items, but it's not so good for data in a systematic 1-dimensional or 2-dimensional form.

Jheald (talk) 13:24, 13 November 2015 (UTC)

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

This proposal received 4 support votes, and was ranked #88 out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Allow_tabular_datasets_on_Commons_.28or_some_similar_central_repository.29

Which wiki to use for storage is discussed at https://meta.wikimedia.org/wiki/User:Yurik/Storing_data

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think there's a very valuable use-case in data sets well below the 100mb range, particularly, all manner of reference data. think of a list of countries, cities within countries, telephone area codes. the names of HTML entities, colour name/codes, et cetera... or hierarchies used for classification purposes as are used in banking.

the reason I made this request to begin with is because I took the trouble to extract data that was already available in the wikipedia, but presented as an HTML table. I formatted it and then wanted to share it as a TSV, which is a very light format (but JSON would be awesome too) such that it could be consumed easily by anyone e.g. a script can ftp the data right from the source, and consume it.

that helps developers save each other time and great effort

I think there's a very valuable use-case in data sets well below the 100mb range

That was just a general example, because it wasn't clear to me if this was being sold as a general purpose solution for all data-sets. 4 mb is the actual limit for wikipages, and honestly, things get kind of sucky once you go > 1 mb.

As an aside, I'd note that if we're making up our own custom data formats, Lua already has a load data thing for loading data sets formatted as lua tables.

Lua tables

I know nothing of that technology but if it can be parsed readily in any environment, that would be fine, otherwise TSVs or JSON are very common formats

@Bawolff Lua supports mw.text.jsonDecode(), which is great for these pages - if we keep TSV in a schema like

{
  "columns": ...
  "data":[
    [1,2,3],
    [4,5,6],
    ...
  ]
}

Change 281331 had a related patch set uploaded (by Yurik):
Implement tabular content support

https://gerrit.wikimedia.org/r/281331

The above patch allows tabular data storage with string, numeric, and "localized" column types. Localized is basically a languagecode->string. Also license and description meta data is allowed (and more can be easily added).

Change 281331 merged by jenkins-bot:
Implement tabular content support

https://gerrit.wikimedia.org/r/281331

matmarex removed a project: Patch-For-Review.

So I guess Yurik is working on making this happen? I'm not sure what is the JsonConfig extension or why it got merged into it.

merged, please take a look at https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_data_storage_for_Commons.21

Demo at http://data.wmflabs.org/wiki/Data:Sample.tabular?uselang=fr (try different languages)

@matmarex , JsonConfig is an extension I built a while ago for storing structured data on wiki and making it available from another wiki. Basically it was a refactoring of my Zero code, to allow this functionality to be shared. It was already used for data namespace for storing JSON (for a limited deployment), but I decided to make some more improvements before promoting it for a wider audience. JsonConfig is a misnomer - because it was initially thought to be for configurations rather any data, even though the code is actually identical. We could of course rename it, but I don't think it is worth the hassle.

Couple quick notes:

  • pretty cool. :)
  • I worry about efficiency of storage and queries; for small tables json blobs are fine but for large data sets thisll get extremely verbose, and loading/saving small updates to a large table will get very slow. Consider providing query and update primitives that don't require the client (lua module, wiki API client, editor form, etc) to load and parse the entire blob. Eg "gimme the rows with key value in this range" or "update columns A B and C with these values in rows with key values X Y and Z". This could then be extended in future to a different storage backend, or a streaming json reader.

@brion, could you think of use cases for partial data reads? I think it will be mostly "draw data as a wiki list or a table with some magical highlighting/string concatenation/...", or "draw a graph with all the data". That's why at this point I simply provide Lua's mw.data.getData("page_name.tabular"), which returns the whole thing as JSON decoded into a Lua table structure. I see no reason not to provide mw.data.query("SELECT field1 FROM page_name WHERE field2 > 10"), but that's a whole other ball game :)

Also, adding JSON storage should provide an easy solution for lookups, e.g. mw.data.lookup("page_name.json", "top-level-field-value"), which could be very efficient for large but shallow jsons.

Pulling individual data items out of large lists; pulling relevant columns in order to sum them; pulling or updating a small number of cells during editing; sub setting a large data set to graph the subset; sub setting a large data set to perform operations on it Ina Lua module.

For a concrete example: say I have a data set that lists the voting breakdown during US primary elections for every county. That's roughly 3000 rows, each with data points for each candidate. Not too awful perhaps, so let's make a bigger table.

Population of every US census place for every 10-year census since 1790. That's probably a lot. Now add more columns for various breakdown information.

Subset queries on a large data set would allow for reading in data relevant to a map area, or a particular place, or a particular breakdown column, or a particular year range, or some combination of all these things, having had to pull in only the column list and do some sort of index scan on the key rows.

The alternative is to break down the giant table into lots of small tables, and have the client code in a Lua module or whatever 'stitch' together the multiple data sets when needing to cross boundaries.

You might say that no one should store that much data in one table, but I'm pretty sure people will push the limits of a system like this. :)

As I understand these are stored as regular MediaWiki pages now, so they have a maximum length of 2 MB. Even naive queries pulling the whole thing into memory would be fast enough at these scales. If we want to think about performance for large data, we should first think about overcoming the length limitation :)

As I understand these are stored as regular MediaWiki pages now, so they have a maximum length of 2 MB. Even naive queries pulling the whole thing into memory would be fast enough at these scales. If we want to think about performance for large data, we should first think about overcoming the length limitation :)

Ah I forgot all about that. ;) That'll at least stop people from creating _super_ huge datasets for now... unless they break them into multiple files and create lua modules to stitch them back together. :) That may be acceptable however, and lets people prototype the kinds of crazy things they want until we hate it so much we decide we have to support them better.

I tried generating some random tables with this PHP script: https://gist.github.com/brion/46469ac2df31a8eb0e179f50b1967d20

I find I can only successfully save a file of under a megabyte even though the max is 2 megs; a 2 meg file gets rejected complaining that it's over 4 megabytes... I notice the output that comes back to me in the edit window is pretty-printed, which means a lot of indentation and newlines bulking up the storage requirements. That might be on purpose to provide a better diff view?

Seems to run reasonably fast to render locally, though it also produces a giant HTML table for the page view that's about 3.5 megs, which should at least compress ok. Editing also means resubmitting the entire data set at once with the edit form.

A partial-edit submission system similar to section editing on wiki pages might be nice to reduce submission latency and bandwidth consumption editing a table, but that can probably wait until it needs to be paired with a spreadsheet-like UI.

Side note: headers are rejected if they contain spaces. That seems odd?

In T120452#2226245, @brion wrote:

Population of every US census place for every 10-year census since 1790. That's probably a lot. Now add more columns for various breakdown information.

Argh. :-) The 'right' (but more complex) solution for that kind of problem is not to try and hack a multi-dimensional dataset into a 2D representation, but to represent it faithfully. However, AIUI that's outside the scope of this work — which I think is a pity. https://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/ is the work I nominally sponsored on this back in the day. It's actually an RDF encoding of the (non-SemWeb) https://sdmx.org/ standard (XML-based ISO standard, widely used in the stats community). More concretely, Excel's "PivotTables" feature is essentially mapping normalised nD data from a 2D format back into an (interactive) series of 2.5D cuts of said data. Ish.

@brion, the pretty printing is for diffs - makes them much easier to digest. A while ago I also tried to make the actual storage use a non-pretty-printed version, but I guess I never finished that part.

@brion, re spaces in headers - I want header names to be identifiers - after all, the target audience is Lua scripts and Graphs - both are programming languages. It makes it much easier to use, because i can simply say datum.buildingHeight instead of datum["building height"] in vega or in Lua. @Eloy has made some suggestions on how to localize them, and I think we should go with the second option - "headers.i18n": [ { "en":"blah", "fr": "blah", ...}, ...] for header localization.

@matmarex, exactly - the page-size limitation restricts how these datasets are used. In a way, this is a much easier to implement and more flexible "multi-stream" proposal, that will allow editors to create shared lists and tables. The giant dataset storage with the proper query interface is a separate task all together. We want to have it, but it is a much bigger target with a much bigger resource requirements.

P.S. For some reason I don't get email notifications from phab anymore :(

Re headers -- yeah need to distinguish between header labels (i18nable text) and column ids (identifiers for programs). As long as capability is there I don't mind the terms used, sounds like you're already working on that :)

I'm totally ok to bikeshed about the naming:

  • for ID, it will be a list of strings named: "headers", "ids", "columns", "header_id", ...
  • for localized column name, it's a list of objects, each object having (language id -> string). We can call it "columns", "labels", "headers", ... ?

I'm totally ok to bikeshed about the naming:

  • for ID, it will be a list of strings named: "headers", "ids", "columns", "header_id", ...
  • for localized column name, it's a list of objects, each object having (language id -> string). We can call it "columns", "labels", "headers", ... ?

I think "column_id" and "columns" for consistency would make sense, but I don't care much. ;_)

@Yurik: The W3C CSV on the Web working group's metadata model recommendation refers to "columns" with attributes for "name" and "titles" (plural, allowing alternates or per-language variants), with similar recommended character restrictions on "name" for ease of programmatic use: https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dfn-column-description

So following the naming model would give us either separate "column_name" and "column_title" keys containing lists of strings/localized string map objs... or following the W3C data model a little more closely would give us a "column" key containing a list of objects with their own "name" and "title"(s?) props.

@Jdforrester-WMF data cube sounds awesome. :D someday later!

In T120452#2231244, @brion wrote:

@Jdforrester-WMF data cube sounds awesome. :D someday later!

Yessir. :-)

More important semi-bikeshed questions:

  • How should we store licenses? Is there a license ID of any sorts? I wouldn't want free form license field text if possible.
  • Are there any other metadata fields required?
  • Should we support datetime fields in mediawiki datetime format? (I would love to re-use all the Wikidata data types, but apparently it is not easy to refactor them out)

And how do you request deletion ?

One idea is to use a multipart contenthandler. The Page namespace of wikisource does this. That way, you can have a part wikitext and a part json. But also potentially adds to confusion

Can the talk pages be used for deletion requests?

On Commons, usually a subpage of "Commons:Deletion requests/Page under discussion" is used for a deletion request, with a tag being placed on the page being discussed. Same on English Wikipedia, save for different names.

@JEumerus, adding a tag to the structured content (json) is not as obvious as for free form wiki markup. We could interpret some meta fields as wiki markup...
So far this is the the structure I'm going for:

{
  "license": "licence-id",  // do we have these IDs in core somewhere?
  "info": {
    "lang-id": "description in that language",  // plain text, localized
    ...
  },
  "wikitext": "Any wiki markup, including [[category:my category]] and {{delete|me}}",  // better name?
  "headers": [ "header1", ... ], // non-localized
  "titles": [
    { "en": "header one", "ru": "первая колонка", ... },  // plain text, localized
    ...
  ],
  "types": [ "string", "localized", "number", "boolean" ],
  "rows": [ [...], [...], [...] ]
}

An extra feature could be "headersRef": "Some other table.tabular" instead of "headers" and "titles", allowing headers to be defined in another table. This way many identically structured tables can benefit from the shared localization.

@Yurik I like that -- maybe generalize it as a metadata inheritance model; anything not filled out in the local json is taken from the referenced .tabular item.

Side note -- the referenced source data: page should get recorded as a template link in the link tables maybe? Or a file link at least. Some kind of reference. :)

Name game: "inheritFrom", "deriveFrom", "ref", "link", "metadata", ...?
Still open question: where to get the license ids and their descriptions :)

I'm a fan of "inheritMetadata" :)

I found MediaWiki:Licenses - I am not sure what it is used for. There are translations in the subpages. Also, there is a list in preferences that I don't know what code controls. I don't want to create a separate licensing system for this, so looking for some ideas.

I found MediaWiki:Licenses - I am not sure what it is used for.

They're used for the licensing dropdown at Special:Upload (try loading the page with JS disabled, otherwise Commons-specific scripts will muddle it up).

Also, there is a list in preferences that I don't know what code controls.

This one is from UploadWizard (the whole preferences tab is specific to UW). It's controlled by $wgUploadWizardConfig['licensing'] (https://github.com/wikimedia/mediawiki-extensions-UploadWizard/blob/master/UploadWizard.config.php#L428).

I'm not sure if either of these is worth reusing. They were both created to match the already existing license templates at Commons. If you want machine-readable identifiers, try the SPDX ones (https://spdx.org/licenses/).

JsonConfig can easily allow us to store all allowed licenses as a "config" page - a JSON with a custom schema for licenses. This way we could have a wiki page named Config:Licenses.json (bikesheding is welcome):

{
  // License ID
  "CC0-1.0": {
    
    // Localized official name of the license. Do we need some 'short name'?
    "name": {
      "en": "Creative Commons Zero v1.0 Universal",
      ...
    },

    // Link to more information. Do we need multiple urls, e.g. FAQ vs Legal Code?
    "url": {
      "en": "http://creativecommons.org/publicdomain/zero/1.0/"
    },

    // https://commons.wikimedia.org/wiki/File:Cc-zero.svg
    "icon": "Cc-zero.svg",

    // https://commons.wikimedia.org/wiki/File:CC0_button.svg
    "button": "CC0_button.svg",

    // if has a value, this license will be shown at the top before all other licenses
    // The license with the lowest number will be shown as the default choice
    "prefered": 1
  },
  ...
}

Another alternative is to actually reuse .tabular for storing this data, instead of creating a custom format. Also, .tabular could benefit from "ordered" meta tag to automatically resort values on save. No sorting is allowed on the fields of "localized" data type.

"ordered": "id"
// or
"ordered": ["prefered", "id"]

I removed T124569 because with T134426 this task could be marked as done (unless we want to keep it open for non-tabular - json/xml data)

@Yurik , would you be available to discuss this at the TechCom-RFC meeting on IRC this week? E213 has the details (the date and time: 2016-06-15, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST)).

Yurik renamed this task from Allow tabular datasets on Commons (or some similar central repository) (CSV, TSV, JSON, XML) to Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...).Jun 21 2016, 8:29 PM

@Yurik and all, I'm glad to see all this work going on, I was pointed to this after I made a comment on a wikidata property proposal that I thought would be best addressed by somehow allowing a tabular data value rather than a single value. However, I'm wondering if this might be best driven by specific problem cases rather than trying to tackle generic "data" records. One of the most common needs is for time-series data: population of a city vs time, for instance, economic data by point in time, physical data like temperature vs time, etc. The simplest extension beyond the single value allowed by wikidata would be to allow a set of pairs defined by two wikidata properties (eg. P585 - "point in time", P1082 - "population"). The relation to wikidata takes care of localization (those properties have labels in many different languages) and defines the value types (time and quantity in this case), and the dataset would somehow be a statement attached to a wikidata item (eg. a particular city) so that the item and pair of properties fully define the meaning of the collection of pairs. The underlying structure of the pairs doesn't really matter much. But there seems to be something missing here - I think it might be best addressed in wikidata itself...

@ArthurPSmith thanks, timeseries do represent a large portion of the common data needs. I have discussed reusing some of the Wikibase type system code with @daniel, but apparently their code is fairly Wikibase centric, especially in the front end, and it might be difficult to reuse. I certainly hope that at some point we will be able to do that. Until then, I am hoping to launch a generic tabular service with a limited subset of types, and later move on to integrate more of Wikibase's work into it.

Hmm, if that goes live where will the repository be?

@JEumerus sorry, did not see your response. The storage repository will be on commons as well, accessible from all other wikis. See community discussion here.

Enabled in https://gerrit.wikimedia.org/r/#/c/326046/ - T148745

I think we should create new tasks for anything that is left to do. Otherwise this task is not actionable - it's just a wish list of many things, which could just as well be individual actionable tasks in the Commons-Datasets.