Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

[Bug] Wikibase API allows adding impossible dates such as "31 September 2014"
Open, LowPublic

Description

I've noticed there are bad dates for some items in Wikidata - e.g. https://www.wikidata.org/wiki/Q5906
and https://www.wikidata.org/wiki/Q5816 both feature dates with September 31. The data should be sanitized by the code to not allow such things to exist.

Notes from Addshore:

Test edit confirming the issue at https://www.wikidata.org/w/index.php?title=Q4115189&diff=187799522&oldid=187799194

Easy test request in the API Sandbox to make the edit (Just add a token)

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.

I just checked both of the items linked to (and the versions that were around when this bug was created) and can not see the date issue / recreate it (Thus this can probably be closed)

It is possible that this has fixed itself or was some very very random bug?
possibly fixed during the switch from JS to backend formatting for everything.?
Could have remained for a while due to some odd caching?

AFAIK the data would never be formatted as "September 31" (in PHP) unless the data was actually Month September, Year 31.

@Addshore - I still see September 31 in https://www.wikidata.org/wiki/Q5906 as a value of "head of government" for item "Q124363" qualifier "end time". Is this something cached or I am not looking into the right place?

Same with https://www.wikidata.org/wiki/Q5816, property "country of citizenship", value Republic of China, end time - 31 September 1949

Interesting...

For "head of government" for item "Q124363" qualifier "end time" I see "31 September 2005"
For https://www.wikidata.org/wiki/Q5816, property "country of citizenship", value Republic of China, end time I see"31 September 1949"

@Smalyshev what language are you viewing in? And also are you logged in or out?

I am logged in, language is English.

Okay! This now sounds very odd! :D
Any chance you could take and upload a screenshot?
Also just to confirm the language code is 'en' and not one of the en variations?
When logged out does the date display in the same way?

If I log out, the "31 September" still shows.

All I was miss understanding the issue, I thought that literally only "31 September" was showing.
The DataValue in questions is as follows and should indeed not be allowed.

"value": {
                                            "time": "+00000001949-09-31T00:00:00Z",
                                            "timezone": 0,
                                            "before": 0,
                                            "after": 0,
                                            "precision": 11,
                                            "calendarmodel": "http://www.wikidata.org/entity/Q1985727"
                                        }
Addshore triaged this task as Medium priority.Jan 13 2015, 7:46 AM

Yes, that's what I meant, sorry my explanation was a bit confusing.

Okay, so when entering "31 September 1949" through the Frontend in an edit and saved it is formatted as "1 October 1949" and correctly saved as "+00000000019-10-01T00:00:00Z"
This suggests that the date in question has been added through the API and extra validation is needed there...

However looking further at the Item I see this edit https://www.wikidata.org/w/index.php?title=Q5906&diff=118880783&oldid=118880605 in which the 31 September is added. Presuming the Front-end was either fixed after this edit (Which was in April) I think it is time to check the API validation!

Addshore renamed this task from Bad dates (like September 31) in some entries to Wikibase API allows adding impossible dates such as "31 September 2014".Jan 13 2015, 1:23 PM
Addshore updated the task description. (Show Details)

I suggest to not implement code that checks for the described type of validity in the Wikibase' extension. This is something external tools (bots and such) should do.

Reasoning:

  • Something like a "31 February" may exist in other calendar models.
  • Different calendar models need different validation. Currently we do not do this. Julian and Gregorian use the same basic validity checks (e.g. the day must be in the range [0,31]).
  • It may make sense to store such a date, for example if a source states a person was born on "31 February". We want to store that, even if it's wrong.

Okay, so when entering "31 September 1949" through the Frontend in an edit and saved it is formatted as "1 October 1949" and correctly saved as "+00000000019-10-01T00:00:00Z"
This suggests that the date in question has been added through the API and extra validation is needed there...

However looking further at the Item I see this edit https://www.wikidata.org/w/index.php?title=Q5906&diff=118880783&oldid=118880605 in which the 31 September is added. Presuming the Front-end was either fixed after this edit (Which was in April) I think it is time to check the API validation!

I find it alarming that 31 September was corrected to 1 October. It is alarming because has not been made clear in this thread what code is making the correction, and what rules are being followed. It may be safe enough to correct 31 September to 1 October, but what if the user enters 30 February 1900? Should that be corrected to 2 March 1900 or 1 March 1900? It depends on whether the calendar model is Julian or Gregorian. Is the code making the correction taking the calendar model into account? If the code is not cognizant of the Julian calendar, will it prevent the correct entry of 29 February 1900 Julian?

I find it alarming that 31 September was corrected to 1 October.

That's how JavaScript and PHP work:

> new Date( 2015, 8, 30 ).toDateString()
'Wed Sep 30 2015'
> new Date( 2015, 8, 31 ).toDateString()
'Thu Oct 01 2015'
php > echo var_export( date( 'D M d Y', strtotime( '2015-09-30' ) ) ) . "\n";
'Wed Sep 30 2015'
php > echo var_export( date( 'D M d Y', strtotime( '2015-09-31' ) ) ) . "\n";
'Thu Oct 01 2015'

Python throws an exception instead:

>>> from datetime import datetime
>>> datetime(2015, 9, 30).strftime('%a %b %d %Y')
'Wed Sep 30 2015'
>>> datetime(2015, 9, 31).strftime('%a %b %d %Y')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: day is out of range for month

Even if we allow any string as date, we still have to decide what to do with such data on export. Most external tools can not work with or even represent invalid dates. Also, searching, etc. needs some sanity in dates to be able to compare it. If we allow dates like September 31, it would be outside of any search or matching algorithm.

I tested JavaScript:

document.write(new Date( 1900, 1, 29 ).toDateString())

Which gave as a result

Thu Mar 01 1900

So this indicates that the JavaScript Date object is the wrong tool for the job we are doing. I haven't tested PHP, but that is suspect too. So the code that is doing the "correction" needs to be identified and tested. If it can't accept valid Julian dates then either a different language that can accept Julian dates needs to be used, or a new object needs to be written that can handle the data we need it to handle.

I find it alarming that 31 September was corrected to 1 October. [...] Is the code making the correction taking the calendar model into account?

Currently not. We are using PHP's DateTime parsing as a fallback when parsing date strings, in addition to our own parsers. You can see this in action: Typing "+2015-02-31T00:00:00Z" triggers our own parser and results in "31 February", "2015-02-31" triggers PHP's parser and results in "3 March".

I find the calculations PHP does not helpful and would love to turn them off, but unfortunately there is no way to do that (other than implementing our own parser and not using PHP's). I'm currently trying to add code that prevents such calculations and stores dates like "2015-02-31" as entered.

The fact that "2015-02-31" can be stored via the API but is magically "corrected" when editing that value via the UI (that's a broken roundtrip) is a bug and must be fixed.

Related patches:

All patches in the above comment are now merged

All patches listed above are unrelated to this issue. To be more precise, they belong to the closely related issue of PHP's parser magically "correcting" 31 September to 1 October. I listed the patches as an answer for @Jc3s5h.

This issue needs discussion: Do we want to accept values such as "31 September 2014"? Or should such values be rejected? And if we decide so, is this the job of the parser (which one?) or the validator? And what exactly should be validated? Leap years? Leap seconds? Against which calendar model?

Personally I suggest to keep this as simple as possible. Enhanced validation should be done by external tools, as argued in my comment T85296#1125065 above. What we could do is a very simple, very basic check: Instead of the current [0,31] range for all months we could have individual ranges for each month. But the range for February will be [0,29]. We will not implement leap year algorithms in the current architecture of datamodel, parsers and validators.

I would agree with @thiemowmde if there were a plausible alternative to the Gregorian and Julian calendars which might require support in the future. But I believe that any such alternative would be so drastically different that there would be no temptation to apply the proposed generic support (e.g. any February may have up to 29 days). Thus, if we were willing to do the work to provide different validation for Gregorian and Julian, there would be minimal risk of having to create a large number of variations on the theme.

About thiemowmde's comment above "It may make sense to store such a date [31 February], for example if a source states a person was born on "31 February". We want to store that, even if it's wrong." It seems to me the purpose of a database is to store the best available conclusion, in a calendar that is, or can be converted to, the Gregorian calendar. Discussion of the merits of various pieces of evidence is fine for history journals, or even a Wikipedia article, but doesn't seem appropriate for Wikidata in its current state. If that sort of thing ever belongs in Wikidata, I think it should be part of an enhanced reference system that provides an opportunity to explain how the text that appears in the source was transformed into the data stored as a Wikidata property.

In cases, such a the birth date of Augustus, where a specific day is named but it can't be transformed into the Gregorian calendar, I think the birth date should be given as a Julian or Gregorian date with looser precision.

@Jc3s5h wrote:

It seems to me the purpose of a database is to store the best available conclusion, in a calendar that is, or can be converted to, the Gregorian calendar. Discussion of the merits of various pieces of evidence is fine for history journals, or even a Wikipedia article, but doesn't seem appropriate for Wikidata in its current state.

Wikidata is not a database, it is a knowledge base. A database system is designed to answer queries, so it needs to be able to compare values. For that reason, you want normalized values in a database. For the Wikidata query service, we (plan to) do exactly what you say: we convert to Gregorian, using our best guess. But for the primary data that is maintained on Wikidata, we want to represent the knowledge as given in the original source we quote - including (up to a point) the inaccuracies, oddities, and mistakes. This is also why we allow contradicting values, and values that are known to be wrong (deprecated).

In reply to daniel's comment at Aug. 1-, 16:14, it seems to me it would be more common to find, in a source, a date given in an ancient calendar that can't be precisely converted to Gregorian, than it would be to find a date that purports to be Julian or Gregorian but is not a valid date (e.g. 31 September). Perhaps whatever method we develop to record dates in unsupported calendars should also be applied to obviously impossible Gregorian and Julian dates. Then statements using the Gregorian and Julian calendars would have to be valid dates.

@Jc3s5h: You mean, we could have a calendar model for "Broken Gregorian"? I like that idea!

It seems clear to me that we shouldn't just allow any input for Gregorian/Julian dates: the "123rd of Juli" or the "15th of Kittens" should not be accepted. However, I don't think the date parser should be responsible for checking for leap years when encountering February 29. Perhaps we can have a validator for checking that. Similarly, the parser should probably accept April 31, since it really only deals with syntax. I have no strong feelings about whether we should have a validator that would reject April 31. That's mostly a product decision. From the technical perspective, we cannot rely on dates getting cleanly normalized to ISO anyway - that should rarely fail for Gregorian dates, but we can never totally rely on it.

(As to parsers vs validators: the parser turns an input string into a TimeValue object, the validator checks whether a given TimeValue object is acceptable. Parsers do not apply to API input. Validators are applied to API input, but not to old values found in the database).

In T85296#1524522, @daniel wrote in part:

@Jc3s5h: You mean, we could have a calendar model for "Broken Gregorian"? I like that idea!

My concern with a "Broken Gregorian" calendar model is the well-demonstrated willingness of editors to add data using whatever models are supported, whether it's true or not. (For example, all the birth dates and death dates for people not born near the Prime Meridian are false because the time zone is set to zero.) A Broken Gregorian model should only be applied to dates which the source purports to be Gregorian, but will probably be misapplied to correctly stated dates in calendars such as the Swedish calendar. Perhaps there could be a concept of a quoted date, with an ability to state the date, the source, and an entitiy or character string to give the calendar.

I think obviously wrong dates should not be allowed, even if a source says so, unless we get a special calendar model for that. Because claiming that somebody was born on 32th of July is meaningless. Wikidata was supposed to be *data* repository, and "32th July in Gregorian Calendar" is not meaningful data. In Wikipedia, which is free-form narrative text, it's fine, but Wikidata I think should be machine-readable, which means the data actually has to make some sense. Of course, you can insert junk data, but it would be useless to 99% of use cases since you can not do anything with it except for displaying it as is. Moreover, every time you write code to handle it (such as displaying "X years old" in infobox) you'd have to make special case for broken data. I think that would be doing the users a disservice. If we claim the date is Gregorian or Julian, it should be a valid Gregorian or Julian date.

I find it exhausting to explain the same things again and again, to the same people.

We never said "Wikidata is a repository of data that make sense". You can store "foo" as an IMDb identifier. You can store "1 January 3100" as a date of birth. You can store "cucumber" as the gender of a person. You will be able to store "-99,000 feet" as the height of a mountain. In all these cases we already do some validation. You can not store empty strings as identifiers. You can not store "foo" as a date or quantity. Same here. We do some validation, but we can not make sure everything makes sense.

We call these things "claims" and "statements" for a reason. Not "facts".

"31 September 2014" is machine readable.

As I said, I would love to have a discussion (like in "sharing actual arguments") about my proposal above: Implement individual ranges for each month. This would solve this ticket.

@Jc3s5h wrote:
"Perhaps there could be a concept of a quoted date, with an ability to state the date, the source, and an entitiy or character string to give the calendar."

That is exactly what Wikidata's TimeValues are. No more and no less.

@Smalyshev We can decide to me more or less strict for calendar models we know well, but there will always be the possibility of "bad" dates, since for some calendar models, we don't even know how to validate. They may not even be strictly defined. So, conceptually, we can have all kinds and shapes of dates, some of which we can convert to canonical ISO. Others we can't. I don't see any problem there. Whether we allow or reject Februrary 32 is a product level decision with no impact on the data model or data mapping.

In T85296#1526755, @daniel wrote in part:

That is exactly what Wikidata's TimeValues are. No more and no less.

Nearly true. But the user interface only supports Julian or Gregorian; I don't know if the API would prevent the addition of an entity other than the URIs for these calendars. A character string cannot be used for the calendar, but one could create an entity and then refer to the entity.

Also, I regard the documentation to be part of the TimeValue. The documentation says you can only use Julian and Gregorian.

@daniel for "some" calendar models, sure. But for Gregorian and Julian, we know which dates are good and which are bad. So if you say "Gregorian date on February 32" it is meaningless since there's no Gregorian date of February 32. It's not a date in any meaningful sense.

You can not store "foo" as a date or quantity.

Storing "Gregorian date of February 32" is the same as storing "foo". You can make the same argument that some source may define date of some event as "foo", or even "on gregorian date of 'foo'". There's nothing to prevent this from occurring.

we can not make sure everything makes sense.

True, but nobody talks about "everything". We can certainly make sure Gregorian and Julian dates make sense, because we do know which Gregorian and Julian dates exist.

"31 September 2014" is machine readable.

"foo" is machine readable in the same sense - it is encoded in bytes, so machine can read it. But by this definition, any encoded information is machine readable and there should be no validation whatsoever once we have some bytes. I don't think you would agree with such approach, @thiemowmde. If we reject "foo" because it does not make any sense as a date, we must reject "31 September 2014 as Gregorian date", because it makes no sense either - you can not use it as a date, it does not have a place on the date line. If we say "31 September 2014 as some broken date that may make sense in some calendar" - then it may be ok.

thiemowmde lowered the priority of this task from Medium to Low.Aug 12 2015, 10:22 AM

I'm sorry, but I do not see how more rants about how the world should be (but is not) are helpful in finding an acceptable compromise that improves the current situation. What's your proposed way forward?

Storing "Gregorian date of February 32" is the same as storing "foo".

No, it's obviously not. The first example contains two machine readable numbers: month number 2 and day number 32. The second example contains nothing. And it's not even a fair example because we never allowed day 32.

we do know which Gregorian and Julian dates exist.

I think we do not (https://en.wikipedia.org/wiki/February_30), but even if we do, how does this change anything? Why should it be forbidden to store e.g. the birth date of a fictional comic book character that was born on February 31? I suggest to read http://blog.wikimedia.de/2013/02/22/restricting-the-world/ again.

In T85296#1531136, @thiemowmde wrote in part:

we do know which Gregorian and Julian dates exist.

I think we do not (https://en.wikipedia.org/wiki/February_30), but even if we do, how does this change anything? Why should it be forbidden to store e.g. the birth date of a fictional comic book character that was born on February 31? I suggest to read http://blog.wikimedia.de/2013/02/22/restricting-the-world/ again.

Above there is a brief discussion of the parser vs. the validator. If I understand correctly, the validator is only active when the data is stored, and is irrelevant for data currently in Wikidata. The parser is obviously needed to parse input (one for the API and a different set for the user interface?) I expect parsing is also needed when doing queries and outputting results, either in JSON or the user interface.

What is not clear is whether the validator or parser(s) consider the calendar model. If not, and if there is no willingness to change them so they do, then the solution sees to be to go with a month-by-month requirement on the allowed date ranges, with [1...29] for February.

February 30 is a real date in the Swedish calendar. If I'm not allowed to enter https://www.wikidata.org/wiki/Q1130275 as the calendar, then it is an invalid date. It is a fictional date in The Lord of The Rings. If I'm not allowed to enter https://www.wikidata.org/wiki/Q15228 as the calendar model, then February 30 is an invalid date.

I see this as a set of changes: the parsers, the validators, the user interface, and the documentation. If there is no willingness to change the whole set, then we should take the view that only valid Gregorian or Julian dates are allowed. If the parsers and validator allow February 29 in every year, that is a programming expedient, and it is nevertheless an error for an editor to introduce such a date.

Jonas renamed this task from Wikibase API allows adding impossible dates such as "31 September 2014" to [Bug] Wikibase API allows adding impossible dates such as "31 September 2014".Nov 2 2015, 5:01 PM