Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Decide on internal representation of (Gregorian and Julian) dates with negative years
Open, HighPublic

Description

When storing Gregorian and Julian dates, we currently normalize them to an ISO-style timestamp format, something like +2015-05-05T00:00:00Z. However, it is unclear how years BCE should be represented. We currently represent 44 BCE as -0044 internally (the year 0 is accepted as input, but has no clear meaning); this is in conflict with ISO 8601, which states that +0 should be used to represent 1 BCE, and -0043 should be used to represent 44 BCE.

We should decide whether we want to continue to use "traditional" year numbering internally, or switch to "astronomical" (ISO) year numbering.

Either way, a lot of the data we currently have internally may actually be incorrect, depending on how it was provided:

  • If a date was entered as "March 15 44 BC", it would currently be represented as -0044-03-15T00:00:00Z. This is consistent with the traditional numbering system.
  • If a date was given in ISO format as "-0043-03-15", it would currently be stored internally unchanged as -0043-03-15T00:00:00Z. This would be wrong if we assume the internal timestamp uses traditional numbering, but the input used ISO format.

By looking at the database, we cannot know which timestamps were created from ISO input. Also, when looking at ISO style input, we do not know which numbering system is meant.

We should:

  • finalize the decision on the internal form, and clearly document it.
  • alert users who enter negative year numbers (as opposed to using AD/BC notation) in the UI.
  • revise all dates with negative years currently in the database.
NOTE: ISO 8601 actually *changed* from representing 44BC as -44 to now using -43 to represent that year. And then, later, XSD followed that change. That's the root of the confusion.

Event Timeline

daniel raised the priority of this task from to High.
daniel updated the task description. (Show Details)
daniel added a project: Wikidata.
daniel added subscribers: Conny, Rical, Liuxinyu970226 and 6 others.

In the Module:Author, I have chose to compute the negative centuries and the life times(around year 0) in ISO style and to display them in "traditional" years and roman centuries. Then centuries and their limits are also impacted by this choose.

Linked or out of scope here?
Years and centuries can be displaid in two manner: (Before Christ/After Christ) or (BCE / CE for Common Era) then I would like a format parameter to permit to choose (ISO/trad) and (J / CE) at the level of wikidata or wiki or user.

To get things moving, I'll make a proposal.

  1. The following fields from the internal representation (as shown in JSON) are essential to correctly interpret a "time" datatype: "time", "timezone", "precision", and "calendarmodel"
  2. The first change to the meaning of the existing fields is the "time" field; year 0 is allowed and is equal to 1 BCE.
  3. The second change to the meaning of the existing fields is "timezone"; it is a character string which may be a character representation of an integer or "nil". If it is a character representation of an integer, it has the same meaning as at present. If it is "nil" the time is local time, the meaning of which is to be inferred from the context.
  4. Currently the Z on the end of the "time" field is meaningless, and this lack of meaning continues.
  5. The "timezone" field, except as mentioned above, retains its existing meaning: Signed integer. Timezone information as an offset from UTC in minutes. For dates before the modern implementation of UTC in 1972, this is the offset of the time zone from universal time. Before the implementation of time zones, this is the longitude of the place of the event, expressed in the range −180° to 180° (positive is east of Greenwich), multiplied by 4 to convert to minutes.
  6. The "precision" field is the only method of expressing precision; precision shall not be inferred from what information is present or absent in the "time" field.
  7. The "precision" field identifies which characters in the "time" field must be valid date or time representations. For example "time":"+0014-08-19T00:00:00Z" combined with "precision":11 (days) is valid, but "time":"+0014-00-00T00:00:00Z" combined with "precision":11 (days) is invalid because 0 is not a valid month and 0 is not a valid day.
  8. The "time" value may be truncated at any point right of the rightmost day digit, provided the truncation does not leave only a single digit for hours, minutes, or seconds. This allows it to be read by a greater variety of existing ISO 8601 parsers. If hours, minutes, or seconds are required by the "precision" field but are not present in the "time" field they shall be deemed to be 0.
  9. The "before" field is reserved for future development.
  10. The "after" field is reserved for future development.

I'm not sure to understand: "Points 9."before" and 10."after" field are reserved for future development.
Is it to distinguish before=coordonates and after=time-zone ?
Here I speak only about time-zone.

Like you, to get things moving, I'll make some proposals, from the ISO style format -0044-03-15T00:00:00Z.

Just after "T" the time zone could be -180 to -1 to 0 to 180, then:
The precision by number 1 ... 11 is limited to the days.
I propose a more adaptable precision query by char:
Y=year, M=month, D=day, W=week, h=hour, m=minute, s=second.
But also (see below) C=century or E=era.

calendarmodel options to ask to get a time-string:

  • I or ISO => like -0044-03-15T00:00:00Z

The ISO time representation is always accessible by modules.

  • J or Julian => Julian
  • G or Gregorian => Gregorian
  • Y or year => year only in distant past, like for the precision parameter, see below.
  • 12 or 24 for AM/PM or 24h cycle.

calendarmodel - From this point, modules writers can manage other calendarmodel for each use.
But this imply: many works from many modules writers, error risks and hasardous interpretations. A library or more calendarmodel options could be better.

  • C => digital century only, at any epoch.
  • R => roman century only, at any epoch.
  • CE => Common era format.
  • JC => Christian format.
  • india => indian standard epochs.
  • china => chinese standard epochs.
  • e or europ => european standard epochs.
  • out of any epoch in a region, we use R(roman century).

Other countries use defined european epochs because their stories are not enough studied and standardised, but that could change in some countries.

calendarmodel options rules:

  • They all must have a different identifier.
  • We can mix them in a single parameter.
  • The default is nil or I for ISO.
  • The default options of europ calendarmodel are G(Gregorian) and R(roman century).

In the parameter named "custom" we define any client format, like in Excell or OpenOffice.
Examples: "D MMM YY" for "2 nov 14" or "DD/MM hh:mm" for "05/12 15:30".

The ISO style format is limited to -9999 and to +9999
The older limit of historic epoch I seen was -1901 in India.
But the archeologia progress, the links with astronomic events, the C14 datations, and biologic recurrent events permit to consider years from distant past.
To manage cases like: 50,000 years, 20 millions years, we could offer simple formats for year before or beyond -9999.

In reply to Rical' comment of Sun, May 24, 08:41 the current documentation is at https://www.mediawiki.org/wiki/Wikibase/DataModel and https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON

I understand that the before and after fields were intended, when a date is uncertain, how many precision units before or after the given date the real date could be. The precision units are:

precision: shortint. The numbers have the following meaning: 0 - billion years, 1 - hundred million years, ..., 6 - millenia, 7 - century, 8 - decade, 9 - year, 10 - month, 11 - day, 12 - hour, 13 - minute, 14 - second.

The calendar model presently is a URL that leads to a Wikibase item about the calendar. This seems adequate and I don't see a need to invent a letter code. The Wikibase items are multi-lingual, which qould be hard to achieve with a single letter.

Currently the user interface converts multiple ways of writing dates into the ISO-8601. There's no need to store many different equivalent formats in the database. The calendars should probably be limited to different calendars, where the data contributors may not know how to transform from one calendar to another, rather than different formats for the same calendar.

Other threads have expressed a need to convert all calendars to and from the Gregorian (possibly proleptic) calendar. So there should be no support for calendars where no one knows how to do the conversion (for example, the pre-Julian Roman Calendar).

Before the implementation of time zones, this is the longitude of the place of the event, expressed in the range −180° to 180° (positive is east of Greenwich), multiplied by 4 to convert to minutes.

I think representing two semantics in one value is not a good idea. One data value should have only one semantics - either this is always minutes, or this is a longitude, but not both. Otherwise, handling such fields becomes very complex, and more complex code breeds errors and inconsistencies.

The "precision" field identifies which characters in the "time" field must be valid date or time representations

I would rather keep all characters valid, even for year-only dates. The reason is that otherwise it is very hard to compare "year 1200" to "12 April 1976" - since the former can not be converted to any point in time. Admittedly, "year 1200" describes not one point in time but a set of them, but for most applications (like "give me all painters born in 13th century in Europe") it is good enough. Not being able to use standard time handling tools to process such data is worse.

The "time" value may be truncated at any point right of the rightmost day digit

The more options the format has, harder it is to parse it correctly and more opportunities there is for bugs in various code and tools. I'd rather keep it simple and have full time always.

Before the implementation of time zones, this is the longitude of the place of the event, expressed in the range −180° to 180° (positive is east of Greenwich), multiplied by 4 to convert to minutes.

I think representing two semantics in one value is not a good idea. One data value should have only one semantics - either this is always minutes, or this is a longitude, but not both. Otherwise, handling such fields becomes very complex, and more complex code breeds errors and inconsistencies.

If we allow only the existing format and also allow only time zones, time zone offsets will be disallowed before 1880, which is when the first time zones were established in the British Isles. This would prevent accurately representing statements about events before the establishment of time zones if the date is know with certainty according to local time, but there is not enough information to state what the date would be in Universal Time.

The "precision" field identifies which characters in the "time" field must be valid date or time representations

I would rather keep all characters valid, even for year-only dates. The reason is that otherwise it is very hard to compare "year 1200" to "12 April 1976" - since the former can not be converted to any point in time. Admittedly, "year 1200" describes not one point in time but a set of them, but for most applications (like "give me all painters born in 13th century in Europe") it is good enough. Not being able to use standard time handling tools to process such data is worse.

The existing entries contain a great many examples of invalid characters such as "00" for the month or date. So requiring that all digits be valid would require having a bot run through the whole database and fixing all the invalid entries.

The "time" value may be truncated at any point right of the rightmost day digit

The more options the format has, harder it is to parse it correctly and more opportunities there is for bugs in various code and tools. I'd rather keep it simple and have full time always.

The converse risk is that some tool will only look at the time value rather than all the fields, and falsely conclude the date and time of Abraham Lincoln's birth is +1809-02-12T05:42:57 plus or minus one second.

The existing entries contain a great many examples of invalid characters such as "00" for the month or date.

These need to be fixed. It is an easily automatable task.

the date and time of Abraham Lincoln's birth is +1809-02-12T05:42:57 plus or minus one second.

Why would anybody do this, given that a) we have precision and b) where 05:42:57 would come from and why would we allow it into the database?

the date and time of Abraham Lincoln's birth is +1809-02-12T05:42:57 plus or minus one second.

Why would anybody do this, given that a) we have precision and b) where 05:42:57 would come from and why would we allow it into the database?

If we require that times contain all the possible digits, up to and including seconds, then we have to adopt some convention about what to put if the time is not known to the second. Since Wikidata currently claims the date was February 12, 1809, in Hodgenville, Kentucky, which has a longitude of 85°44'19"W, we might decide to first convert the time to +1809-02-12T00:00:00 local time, and then convert the local time of 85°44'19"W to the Universal Time. If the precision is given as one day but the place of the event is not 0° longitude, how else would the conversion be done?

If we require that times contain all the possible digits, up to and including seconds, then we have to adopt some convention about what to put if the time is not known to the second

Of course. The natural choice would be to put 0 for time, and 1 for month/day.

we might decide to first convert the time to +1809-02-12T00:00:00 local time, and then convert the local time of 85°44'19"W to the Universal Time

Why would we decide that? That would be rather strange decision given that we don't have time accuracy there. I would rather leave dates alone and not try to convert anything.

To distinguish time-zones we can keep the character "T", and for geographic position, replace "T" by "G".
Our numeration is based on positions of digits, with significant zéros on the right for empty powers of ten. We cannot truncate the zeros after the left significant part. Then the precision can well define the good limit, if the precision is known.
About the century question, this is a good use of the "after" parameter: "Give me painters from 1200, along 100 years".
A day or month value of "00" means unknown and imply to adapt the precision (if it is not already worse).

To convert geographic position to a time more accurate than a day we must conform to the rules of each country.
Rare are the countries which use solar time as local time. Now mainly Pacific Ocean islands near the line of date change for touristic reasons. There is no reason to add an automatic precision if there is no official solar local time, or special mention of an hour or more accurate.

To convert geographic position to a time more accurate than a day we must conform to the rules of each country.
Rare are the countries which use solar time as local time. Now mainly Pacific Ocean islands near the line of date change for touristic reasons. There is no reason to add an automatic precision if there is no official solar local time, or special mention of an hour or more accurate.

Today the use of local solar time for civic affairs is rare to non-existent. But before 1880 everyone did this.

Also, it is common for the date to be recorded, and presumably accurate, since even people without clocks can count days. But the 24 hour span in local time must be converted to Universal Time to avoid a loss of accuracy; otherwise 24 hour uncertainty becomes approximately 48 hour uncertainty.

But the 24 hour span in local time must be converted to Universal Time to avoid a loss of accuracy; otherwise 24 hour uncertainty becomes approximately 48 hour uncertainty.

Only if you insist on ascribing timezones to dates and comparing dates in different locations with accuracy better than days. Which, given that the source data rarely has such accuracy, may not be a good idea.

This ticket is about whether to write 44BC as -44 or -43. Time zones are completely irrelevant for this.

When referring to "ISO", please note that ISO 8601 actually *changed* from representing 44BC as -44 to now using -43 to represent that year. And then, later, XSD followed that change. That's the root of the confusion. So if you say "ISO", please be clear whether you mean the old or the new interpretation.

daniel set Security to None.
In T99674#1648463, @daniel wrote, in part:

When referring to "ISO", please note that ISO 8601 actually *changed* from representing 44BC as -44 to now using -43 to represent that year. And then, later, XSD followed that change. That's the root of the confusion. So if you say "ISO", please be clear whether you mean the old or the new interpretation.

I would request that contributors not use "ISO" at all, because ISO requires the Gregorian calendar and only the Gregorian calendar. Wikidata now supports both Julian and Gregorian calendar.

I also suggest that when we reach a conclusion, the documentation of the conclusion explicitly state that it applies regardless of whether the year is a year of the Gregorian calendar or a year of the Julian calendar. This is necessary because an important date conversion book, Calendrical Calculations by Dershowitz and Reingold and a popular conversion website, Calendar Converter, use idiosyncratic convention that there is a year 0 when negative years are used with the Gregorian calendar, but there is not a year 0 when negative years with the Julian calendar. We should explicitly reject the notion that the existence of a year 0 depends on which of these two calendars is used.