Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Improving the Veracity of Open and Real-Time Urban Data

Within the context of the smart city, data are an integral part of the digital economy and are used as input for decision making, policy formation, and to inform citizens, city managers and commercial organisations. Reflecting on our experience of developing real-world software applications which rely heavily on urban data, this article critically examines the veracity of such data (their authenticity and the extent to which they accurately (precision) and faithfully (fidelity, reliability) represent what they are meant to) and how they can be assessed in the absence of quality reports from data providers. While data quality needs to be considered at all aspects of the data lifecycle and in the development and use of applications, open data are often provided 'as-is' with no guarantees about their veracity, continuity or lineage (documentation that establishes provenance and fit for use). This allows data providers to share data with undocumented errors, absences, and biases. If left unchecked these data quality issues can propagate through multiple systems and lead to poor smart city applications and unreliable 'evidence-based' decisions. This leads to a danger that open government data portals will come to be seen as untrusted, unverified and uncurated data-dumps by users and critics. Drawing on our own experiences we highlight the process we used to detect and handle errors. This work highlights the necessary janitorial role carried out by data scientists and developers to ensure that data are cleaned, parsed, validated and 2 transformed for use. This important process requires effort, knowledge, skill and time and is often hidden in the resulting application and is not shared with other data users. In this paper, we propose that rather than lose this knowledge, in the absence of data providers documenting them in metadata and user guides, data portals should provide a crowdsourcing mechanism to generate and record user observations and fixes for improving the quality of urban data and open government portals....Read more
1 Improving the Veracity of Open and Real-Time Urban Data Gavin McArdle 1 and Rob Kitchin 2 1 National Centre for Geocomputation, Maynooth University, County Kildare, Ireland, gavin.mcardle@nuim.ie 2 National Institute for Regional and Spatial Analysis, Maynooth University, County Kildare, Ireland, rob.kitchin@nuim.ie The Programmable City Working Paper 13 http://www.nuim.ie/progcity/ 13th August, 2015 Prepared for Data and the City workshop, 31 Aug-1st Sept 2015, Maynooth University Abstract Within the context of the smart city, data are an integral part of the digital economy and are used as input for decision making, policy formation, and to inform citizens, city managers and commercial organisations. Reflecting on our experience of developing real-world software applications which rely heavily on urban data, this article critically examines the veracity of such data (their authenticity and the extent to which they accurately (precision) and faithfully (fidelity, reliability) represent what they are meant to) and how they can be assessed in the absence of quality reports from data providers. While data quality needs to be considered at all aspects of the data lifecycle and in the development and use of applications, open data are often provided ‘as-is’ with no guarantees about their veracity, continuity or lineage (documentation that establishes provenance and fit for use). This allows data providers to share data with undocumented errors, absences, and biases. If left unchecked these data quality issues can propagate through multiple systems and lead to poor smart city applications and unreliable ‘evidence-based’ decisions. This leads to a danger that open government data portals will come to be seen as untrusted, unverified and uncurated data- dumps by users and critics. Drawing on our own experiences we highlight the process we used to detect and handle errors. This work highlights the necessary janitorial role carried out by data scientists and developers to ensure that data are cleaned, parsed, validated and
2 transformed for use. This important process requires effort, knowledge, skill and time and is often hidden in the resulting application and is not shared with other data users. In this paper, we propose that rather than lose this knowledge, in the absence of data providers documenting them in metadata and user guides, data portals should provide a crowdsourcing mechanism to generate and record user observations and fixes for improving the quality of urban data and open government portals. 1. Introduction The availability of open data detailing various aspects of cities continues to grow. This is driven by pressure on local and national governments and public organisations to release their data into the public domain for use and reuse for civic and commercial purposes, to create transparency in city operations, and as a way of benchmarking a city’s performance (Pollock 2006, Janssen 2012, Open Knowledge Foundation 2012). Opening data, it is argued, will foster innovation, provide the raw material for monitoring tools, allow comparison between jurisdictions, inform decision making, and ultimately lead to a sustainable, resilient and democratic city (Bates 2012, Kitchin et al., 2015). For example, McKinsey (2013) estimates that as much as $5 trillion a year could be added to enterprises as a result of open data. As the race to open datasets advances there are risks that the checks and balances necessary to ensure the veracity of the data or to inform users of potential quality issues are not performed. Failing to communicate these risks to data consumers or end users of applications will lead to poor quality derived data, buggy applications and ultimately to poor decisions. Of course, there are challenges facing data providers in detecting various data issues and describing their veracity. For example, measuring data quality usually requires an understanding of the intended purpose, which may not be known by the data producer when sharing the data. It also requires significant overhead in resourcing to produce and share relevant metadata. In the case of real-time data, the velocity and exhaustiveness of the data pose particular challenges. Nonetheless, failing to tackle data veracity issues would be a retrograde position for the open data movement, with open data sites being potentially becoming seen as little more than untrusted, unverified and uncurated data dumps. This paper describes two smart city applications which rely on a variety of data sources: first, a real-time dashboard, which uses data generated by city authorities and government agencies to provide an interface showing what is happening in Dublin; second, an application which uses data from the Irish Census and city authorities to simulate and model traffic in Dublin
Improving the Veracity of Open and Real-Time Urban Data Gavin McArdle1 and Rob Kitchin2 1 National Centre for Geocomputation, Maynooth University, County Kildare, Ireland, gavin.mcardle@nuim.ie 2 National Institute for Regional and Spatial Analysis, Maynooth University, County Kildare, Ireland, rob.kitchin@nuim.ie The Programmable City Working Paper 13 http://www.nuim.ie/progcity/ 13th August, 2015 Prepared for Data and the City workshop, 31 Aug-1st Sept 2015, Maynooth University Abstract Within the context of the smart city, data are an integral part of the digital economy and are used as input for decision making, policy formation, and to inform citizens, city managers and commercial organisations. Reflecting on our experience of developing real-world software applications which rely heavily on urban data, this article critically examines the veracity of such data (their authenticity and the extent to which they accurately (precision) and faithfully (fidelity, reliability) represent what they are meant to) and how they can be assessed in the absence of quality reports from data providers. While data quality needs to be considered at all aspects of the data lifecycle and in the development and use of applications, open data are often provided ‘as-is’ with no guarantees about their veracity, continuity or lineage (documentation that establishes provenance and fit for use). This allows data providers to share data with undocumented errors, absences, and biases. If left unchecked these data quality issues can propagate through multiple systems and lead to poor smart city applications and unreliable ‘evidence-based’ decisions. This leads to a danger that open government data portals will come to be seen as untrusted, unverified and uncurated datadumps by users and critics. Drawing on our own experiences we highlight the process we used to detect and handle errors. This work highlights the necessary janitorial role carried out by data scientists and developers to ensure that data are cleaned, parsed, validated and 1 transformed for use. This important process requires effort, knowledge, skill and time and is often hidden in the resulting application and is not shared with other data users. In this paper, we propose that rather than lose this knowledge, in the absence of data providers documenting them in metadata and user guides, data portals should provide a crowdsourcing mechanism to generate and record user observations and fixes for improving the quality of urban data and open government portals. 1. Introduction The availability of open data detailing various aspects of cities continues to grow. This is driven by pressure on local and national governments and public organisations to release their data into the public domain for use and reuse for civic and commercial purposes, to create transparency in city operations, and as a way of benchmarking a city’s performance (Pollock 2006, Janssen 2012, Open Knowledge Foundation 2012). Opening data, it is argued, will foster innovation, provide the raw material for monitoring tools, allow comparison between jurisdictions, inform decision making, and ultimately lead to a sustainable, resilient and democratic city (Bates 2012, Kitchin et al., 2015). For example, McKinsey (2013) estimates that as much as $5 trillion a year could be added to enterprises as a result of open data. As the race to open datasets advances there are risks that the checks and balances necessary to ensure the veracity of the data or to inform users of potential quality issues are not performed. Failing to communicate these risks to data consumers or end users of applications will lead to poor quality derived data, buggy applications and ultimately to poor decisions. Of course, there are challenges facing data providers in detecting various data issues and describing their veracity. For example, measuring data quality usually requires an understanding of the intended purpose, which may not be known by the data producer when sharing the data. It also requires significant overhead in resourcing to produce and share relevant metadata. In the case of real-time data, the velocity and exhaustiveness of the data pose particular challenges. Nonetheless, failing to tackle data veracity issues would be a retrograde position for the open data movement, with open data sites being potentially becoming seen as little more than untrusted, unverified and uncurated data dumps. This paper describes two smart city applications which rely on a variety of data sources: first, a real-time dashboard, which uses data generated by city authorities and government agencies to provide an interface showing what is happening in Dublin; second, an application which uses data from the Irish Census and city authorities to simulate and model traffic in Dublin 2 City. The paper focuses on examining the quality of the data used in both applications, which include real-time urban data relating to transportation and environment. With no guidance on the veracity of the data, except for limited lineage metadata and the reputation of the data providers, we needed to validate each dataset using a combination of domain knowledge and analysis. This paper discusses the steps we took to validate and repair problematic data and presents our interactions with the data providers when errors were discovered. Typically, data cleansing stages undertaken by data intermediaries (such as dashboard builders) are ‘black-boxed’ and hidden from end users and the original data producers. This paper discusses the need to inform application users about this process so they can trust the analysis, cleaning, parsing and validating processes and make informed decisions about the data. Despite this being a known issue and there being examples of veracity metrics and international standards for reporting data quality, open data portals typically do not use them. While our experience shows there is willingness for data providers to engage with data consumers, the resources are not necessarily available to achieve this in a meaningful and large scale way. In the absence of this we discuss the possibility of borrowing techniques from crowdsourced open data as a method to curate and report the quality of urban data so that the steps taken by others, and the errors, problems and uses of the data, are shared in the same spirit of Volunteered Geographic Information (VGI). The information revealed through this process can be used by the providers to fix data and utilised by other data consumers when making a judgement on the veracity of urban data. In the next section, we present several guidelines and standards which are related to the quality of the data which are used in our applications. In Section 3, two case studies are presented to highlight the typical validation process which data consumers apply. In Section 4, we examine the possibility of using the wisdom of the crowd and a technical solution to report data quality and usage. Finally, in Section 5, some conclusions and directions for future work are presented. 2. Data Veracity Metrics There have been several guidelines and measures proposed to provide a common platform for describing data quality measures (Batini et al., 2009) and the importance of reporting data quality has been recently recognised through several ISO standards, such as ISO 191151:2014 which sets minimum and mandatory metadata fields that should accompany spatial data, and ISO 19157:2013 which is a dedicated standard for describing components and 3 principles for the quality of spatial data. These standards do not indicate acceptable thresholds for quality data, but rather mandate the metadata that needs be generated with respect to data veracity in order to receive the standard. Here, we concentrate on some of the most relevant measures for spatial and transport data, the focus of our two case studies, and discuss their application to open data sites. Shi et al. (2003) review the determination and handling of spatial data quality, building on the work of the International Cartographic Association (ICA) who identified seven key metrics related to spatial data accuracy (Guptill and Morrisson, 1995): • • Lineage. The history of the data including details of the source material and any transformations or processes applied in order to produce the final data. Positional Accuracy. An indication of the horizontal and vertical accuracy of the coordinates used in the data, both to absolute and relative locations. It must account • for the processes applied to the data which are described by the lineage. • the spatial data. Attribute Accuracy. The accuracy of the quantitative and qualitative data attached to Completeness. The degree to which spatial and attribute data are included or omitted from the datasets. It also describes how the sample is derived from the full population • • and presents the spatial boundaries of the data. Logical Consistency. The dependability of relationships within the spatial data. Semantic Accuracy. The quality with which geographical objects are described in accordance with the selected model. Semantic accuracy refers to the pertinence of the • meaning of the geographical object rather than its geometry. Temporal Data. The date of observation, the type of update and the validity period for the data. Likewise, the transport science community have defined similar measures for reporting the quality of traffic data. Turner (2002) carried out an intensive study of data veracity measures and concluded that there are six core measures required to describe the accuracy of traffic data: • • Accuracy. How closely the data collected match actual conditions. Completeness. The degree to which data values are present in the attributes that require them. 4 • • Validity. The degree to which data values satisfy acceptance requirements within the domain. • Timeliness. The degree to which data are provided at the time required. • which is measured. Coverage. The degree to which data values accurately represent the whole of that Accessibility. The relative ease with which data can be retrieved and manipulated by data consumers. Additionally, it was recommended that data quality reports are presented in the metadata alongside the datasets. Including metadata about the quality and veracity of data allows data consumers to assign an internalised confidence score to the various aspects of the data. This will influence how the data are used and how the results are interpreted. Moreover, the Environmental Protection Agency in the United States have developed a set of four questions (EPA, 2006) to which answers should be published alongside environmental data in order to allow data consumers assess its quality and determine if it is fit for their specific purpose. 1. Can a decision (or estimate) be made with the desired level of certainty, given the quality of the data? 2. How well did the sampling design perform? 3. If the same sampling design strategy is used again for a similar study, would the data be expected to support the same intended use with the desired level of certainty? 4. Is it likely that sufficient samples were taken to enable the reviewer to see an effect if it was really present? Combined, the four questions allow data consumers to make informed decisions about using the data for their requirements and also provides a guide for how to interpret the results correctly and the weight to place on the results in a decision making process. In contrast, it has been argued by some that big data initiatives utilising real-time data do not need the same standards of data quality, veracity and lineage because the exhaustive nature of the dataset removes sampling biases and more than compensates for any errors or gaps or inconsistencies in the data or weakness in fidelity (Mayer-Schonberger & Cukier, 2013). The argument for such a view is that “with less error from sampling we can accept more measurement error” (p.13) and “tolerate inexactitude” (p. 16). Nonetheless, the warning “garbage in, garbage out” still holds and issues of accuracy, completeness, validity, 5 timeliness, coverage and accessibility remains important. For example, real-time data can be biased due to the demographic being sampled (e.g., not everybody uses social media platforms) or the data might be gamed or faked through false accounts or hacking (e.g., there are hundreds of thousands of fake Twitter accounts seeking to influence trending and direct click stream trails) (Bollier, 2010; Crampton et al., 2012). Moreover, the technology being used and their working parameters can affect the nature of the data. For example, the quality of a pollution or sound sensor can affect the ‘noisiness’ of the data generated (Choi et al., 2009); which posts on social media are most read or shared are strongly affected by ranking algorithms not simply interest (Baym, 2013). Similarly, APIs structure what data are extracted, for example in Twitter only capturing specific hashtags associated with an event rather than all relevant tweets (Bruns, 2013), with González-Bailón et al. (2012) finding that different methods of accessing Twitter data -- search APIs versus streaming APIs -- produced quite different sets of results. As a consequence, there is no guarantee that two teams of researchers attempting to gather the same data at the same time will end up with identical datasets (Bruns, 2013). There are now a plethora of smart city data standards being developed aimed at improving and aligning the data being generated (see ANSSC 2015 for an overview). While these general metrics and associated metadata are applicable to all data, including those held within open data portals, at present, metrics applied to open data are generally more concerned with measuring the nature of the data included or the value of the data portal rather than the quality or veracity of the data contained within. For example, Berners-Lee (2009) presents a star rating for open data and awards the highest quality grade to machine readable and linked open data while data in unformatted pdf files, which are still open data but are not as useable as machine structured data such as CSV and JSON file formats, receive a lower grade. Martín et al. (2015) focus on studying the usability, functionality and data formats of 36 Open Government Data portals. While accuracy of the portals is considered, it is merely a check as to whether the data description matches the data. Similarly, Umbrich et al. (2015) apply the core metrics for assessing data quality described by Batini et al. (2009) to the metadata provided in data portals, but do not consider the veracity of the data to which the portal provides access. The Open Data Institute has developed a certificate which data producers can use to add credibility to their data. The certification is self assigned and is obtained by the provider by answering a series of questions about their data. A description of the quality control process needs to be presented alongside the data in order to become 6 accredited (ODI, 2015). Similarly, the EU INSPIRE Directive requires spatial data quality and lineage to be reported alongside the data (Inspire, 2015). Despite these guidelines, recommendations, certificates and standards for reporting data quality, open data portals typically do not report enough metadata to enable consumers to make a reliable judgement call regarding the quality of the data. A review of open data portals for the urban areas of London (http://data.london.gov.uk/), Paris (http://opendata.paris.fr/) and Dublin (http://www.dublinked.ie), and the World Council of City Data (which reports data for 253 cities in 80 countries; http://open.dataforcities.org) reveal that neither general nor specific measures of data quality are reported. While data lineage, such as the age of the data (timeliness) and name of data provider, are generally given, the transformation process from the raw to finished product is not described. Similarly, the spatial and temporal extent is given, but the accuracy and precision measurements are not provided. Although our case studies show how fundamental errors were detected, there are potentially technical, political and financial pressures preventing data providers from delivering this information to consumers. Given the potentially infinite uses of different classes of urban data, it is also difficult for data producers to give reliable veracity and quality scores for each domain. Nonetheless, there is a need for much better analysis and sharing of data quality. In Section 4, we discuss the possibility of using a crowdsourced approach to rate the quality of data in different domains. The approach would use the create, discuss and edit paradigm used for collecting and curating open data on platforms such as Wikipedia and Open Street Map (OSM). 3. Case Studies In this section, we discuss the process which we used to validate and clean urban data for two projects. The description presented is representative of our experience of working with a variety of urban data during the development of the Dublin Dashboard (Kitchin et al. 2015) and Human Mobility projects (McArdle et al. 2014; McArdle et al. 2012). Visualising Real-Time Traffic Data on the Dublin Dashboard The Dublin Dashboard provides citizens, government workers and companies with real-time information, urban indicator and benchmarking data, and other forms of data about all aspects of the city through a series of interactive graphs, maps and applications. It aims to enable users to gain detailed, up-to-date intelligence about the city that will help foster smart 7 decision making and smart citizens. The data are sourced from a variety of data providers including the Central Statistics Office, the Department of Environment, Community and Local Government, Environmental Protection Agency, Dublinked, and the four Dublin local authorities. One goal of the Dublin Dashboard is to answer questions related to what is happening in the city right now. To achieve this, the dashboard collates, analyses and presents real-time data. This is materialised via real-time maps which show the locations where data are recorded and indicates the current value of the variable being measures. The real-time data streams relate to transport (e.g., how many bikes/spaces are in bike stands, road speeds on different segments, the number of spaces in car parks, general CCTV footage) and environment (e.g., air traffic, air quality, pollution readings, water levels, sound levels, current weather). The data are collected from a variety of sources. In some cases the data are provided via an Application Protocol Interface (API) which allows a developer to query the data and obtain results in a machine readable format, generally JSON; the results can then be digested, presented and used in applications. Other data providers use file formats such as CSV (Comma-Separated Values) or XML (Extensible Mark-UP Language). In these cases, the data consumer or developer needs to parse and process the files in order to select those data which are relevant to their application domain. Figure 1: The Real-Time Travel Map on the Dublin Dashboard. 8 The real-time travel map, shown in Figure 1, is one of the most frequently viewed tools in the Dublin Dashboard. The map shows the predicted travel time, by car, on all major artery routes into and out of Dublin City. The data are provided by Dublin City Council (DCC) and are obtained using TRIPS (Travel-time Reporting and Integrated Performance System) which predicts travel time based on data generated by on-street traffic detection technologies (e.g. transduction loops). The data are published by DCC every minute via a CSV file which is available to download from the Dublinked website. Dublinked is a data portal used by the four Dublin local authorities to share data with the public and organisations in order to promote entrepreneurship and innovation. The TRIPS data were first made available in 2012 and have since been viewed over 2000 times (Dublinked, 2015). DCC reserves the right to cancel access or permission for data use and will not be held liable for any losses arising from their use, or from the use of other information based on these data. There is no indication of the data veracity supplied with the dataset. The CSV travel time data needs to be used in conjunction with other static files which describe the road network. These files are also available to download from Dublinked. The network consists of a list of nodes (named road intersections) and links (roads connecting intersections). A unique id for each node and link is used to match the travel time data. This allows the travel time for each road segment to be reported and used in other applications and software. After downloading the relevant CSV files, a developer can produce a matrix representation of travel time. Table 1 shows an example of travel times extracted from the dataset. FROM SWORDS RD SHANOWEN RD TO INCHICORE ROAD OLD KILMAINHAM LANE TRAVEL TIME (secs) 36 DORSET ST GARDINER ST BLESSINGTON ST DORSET ST 42 CONSTITUTION HILL WESTERN WAY FLATS PHIBSBORO ROAD 73 NORTH CIRCULAR ROAD DOYLES CORNER Table 1: Sample of journey times from the DCC TRIPS dataset. It is relatively easy to process the data provided to get to this stage of development. The data can then feed into other applications such as route planning, journey time or traffic analysis software or a travel map like that in the Dublin Dashboard. Without domain knowledge of the geography of Dublin, or mapping the data to add context, the errors in the dataset are not 9 visible. However, creating a map of the segments and times reveals a number of issues. For example, Figure 2 shows data in Table 1 revealing the impossible journey time of 36 seconds to travel over 6 kilometres due to the inclusion of a false road segment. The dataset contains several examples of impossible journey times like this. Figure 2: The sections of road for which travel-times are given in the TRIPS dataset. The highlighted line is a false road segment over 6 km in length and has a reported travel time of 36 seconds. The lineage metadata shows the data originate from a reliable source (DCC) but does not contain processing information, and despite the ease with which we were able to detect the errors without using any specialist tools, the data providers do not report an error. As developers, our solution was to remove the road segments which contained impossible travel times from the dataset and to make no claims regarding the accuracy of the data displayed in the dashboard. While the solution was adequate for our application, we do not report the errors to users of the dashboard nor do we report the techniques we used to identify and fix them. In part, this is because once fixed they are no longer a problem, but also because we have no way of verifying the data beyond spotting obvious flaws without working directly with the data provider or deploying some form of ground truthing for which we have no resources. 10 Building a Traffic Simulation for Dublin The second urban data project considered builds an agent-based traffic simulation for Dublin City (McArdle et al., 2014). The model simulates the travel patterns for private vehicles in the Greater Dublin Region and attempts to minimise the travel time for individual vehicles by rerouting commuters on the road network and adjusting departure times through many iterations of the simulation. The simulation completes when equilibrium is achieved and further alterations to routes and travel times will not improve the overall system wide travel time. The project uses a variety of urban data sourced from POWSCAR (Place of Work and School - Census of Anonymised Records) and SCATS (Sydney Coordinated Adaptive Traffic System) as input to the simulation and as a means of validating the simulation results and output. To create the simulation a tool called MatSim was used (Rieser, 2010). In order to determine where vehicles travel, the tool requires a demand to be placed on the road network. For this project the initial demand consists of the home and work locations of individuals organised into an origin-destination matrix augmented with the mode of transport and departure times. This simulation only considers individuals who live or work in Dublin and commute by driving a private car. The demand data were obtained from the Irish National Census, which is conducted every 5 years, the most recent of which was conducted in 2011, through its POWSCAR subset. POWSCAR provides the home, work, school, and college location of individuals; the mode of transport used to commute; the time at which individuals leave their home in the morning along with other variables such as age, socioeconomic grouping, household size and travel time to work, school or college. The home location is anonymised by describing it at a Small Area level which is a geographic area consisting of 80 to 100 households. The work location is presented at a 250-meter grid level. The time of departure is represented by discrete 30 minute intervals for the morning period, and several transit modes are encoded in the means of transport while travel time is described in minutes. When only drivers who commute to or from County Dublin were considered a dataset of approximately 300,000 individuals was obtained. Prior to running the simulation, data verification was carried out to test the validity of the POWSCAR input data for the traffic modelling. Direct ground truth was not an option so data analysis was carried out. The analysis focused on the self-declared journey time parameter, which is an estimate made by individuals regarding their commuting time. The 11 distance between the centroid of the home and work small areas was calculated and used with the journey time to estimate an average speed for the commuting trip. The speed of each commuter is shown in Figure 3. The graph shows that many commuters achieved impossible average speeds. The lower red line shows the mean speed of 35 km/hr, while the upper blue line shows a speed of 120 km/hr which is the legal speed limit in Ireland but is not achievable in Dublin during the commuting hours. Further analysis reveals that over 10k commuters had an average commuting speed of over 120km/hr (3% of the dataset), 7k commuters had an estimated average speed of over 200km/hr, while over 1k commuters had a speed greater than 1000Km/hr. This analysis shows inconsistencies in the dataset which were not reported in the documentation. A filter was developed to remove individuals with an impossible travel time from the input matrix before using the data in the traffic simulation process. Figure 3: The speed distribution of commuters in the POWSCAR dataset. The output of the simulation is an hourly count for the number of vehicles using each road segment in Dublin. This enables a 24 hour profile to be produced across the city. In order to validate the effectiveness of the simulation techniques, these data are usually benchmarked against ground truth for the same road segment. The ground truth can be obtained using a manual observation count or by using count data from moveable or embedded sensors in the road surface. Given the cost and resources required to conduct a manual count and the limited coverage of the city that such a count can achieve we opted to use data obtained from SCATS (Sydney Coordinated Adaptive Traffic System) for Dublin. SCATS is a technology which is used to optimise traffic flow by counting cars passing through an intersection and using this data to control the traffic light sequence. While the data are collected in near real12 time, a sample of the data in CSV files is made available by DCC via the Dublinked (Dublinked 2015b). The data were first provided in January 2012 and updated in April 2012 and made available under a PSI licence. The data are aggregated over 5 minute intervals for each approach to an intersection for each day and data are provided for the period 1 Jan 2012 to 30 April 2012 (8-12 months after the POWSCAR data were generated). Each sensor is described in the CSV file as a street segment id, arm number and angle which describe the approach road (latd, lond) and the centroid coordinates of the intersection. An example of this is shown in Table 2. streetSegId armNumber armAngle Lato longo latd longd 681 1 0 53.33981 -6.24184 53.3398 -6.24175 160 1 0 53.34437 -6.26286 53.34435 -6.26276 1396 1 0 53.34513 -6.23838 53.34512 -6.23828 862 1 0 53.34564 -6.24899 53.34563 -6.24889 Table 2: An example of the CSV file describing the road intersections The challenge is to map the SCATS sensors to the road network used in the simulation process. Initial spatial queries showed there was no direct technique to map the sensor coordinates reliably to a road segment. Geovisual analysis, as shown in Figure 4 highlights the problem. The coordinates indicating the location of the SCATS sensors are mildly inaccurate. Attempts to use translations, transforms and offsets to align the sensors with the road network failed to improve their alignment. Dialogue was entered into with DCC representatives to remedy the situation. A visualisation of the problem as shown in Figure 4 was provided. The engineers in DCC were very responsive and conducted their own analysis on the raw data which describes their sensor network but no satisfactory solution was found. Finally, we were informed that the location data was only indicative of the location of the sensors. It was therefore impossible to automatically match road segments to SCATS sensors with any accuracy. Due to the number of sensors, it was not feasible to manually recode the coordinates of each sensor and so this method of validation was abandoned. Instead, data produced by the National Roads Authority in Ireland, which shows the hourly traffic volume between intersections on motorways around Dublin were used to validate the output of the traffic simulation. 13 Figure 4: The locations of the SCATS sensors do not correspond to the road intersections. As with the first case study project, we spent time and effort examining the data and applying different techniques to validate and repair the data and to determine if it matched our requirements. This effort and its results are not recorded or reported alongside the data or within the applications. Furthermore, despite informing the data providers of the error in the SCATS data, no action was taken by the provider to issue a data quality statement. This means that future data consumers will need to replicate the work that we carried out before finding the errors and will need to determine if the data is fit for their specific purpose. 4. Discussion The problem of reporting data quality is recognised and well understood. As Section 2 highlights, there are a variety of standards available for reporting data quality and other relevant metadata (e.g. calibration). While some measures are domain specific, several core quality metrics have emerged. From our experience, data producers are not using these measures, or are not reporting them, and instead provide data with no commentary about their veracity and only offer scant details of their lineage. This is either through ignorance, lack of resources, indifference, or a lack of expertise. Unfortunately, producing urban data which contains undocumented errors is a retrograde step for the open data movement. As such veracity issues come to light they have the potential to fuel accusations that open government data portals are untrusted, unverified and uncurated data dumps. While the value of open data for the economy and for business has been well documented (McKinsey, 2013), the cost to business of using poor quality data is also recognised. While some might argue that, in the 14 case of big data, more trumps better, the reality is that poor veracity reduces the validity of analysis and interpretation. The preferred solution to the lack of documented data veracity is for data producers to more diligently and extensively document such issues in their metadata, along with user guides as to how best to address or compensate for them when using for different purposes. However, if the status quo remains, the onus falls to data consumers and developers to determine whether they are satisfied that the urban data they are using are reliable and fit for their intended purpose. In our case we were developing two urban applications which used a mix of open administrative census data and real-time travel data which did not have quality measurements in the form of metadata. We therefore applied domain knowledge and various analysis techniques to validate the data. Three different datasets were considered and in each case, errors in the data were discovered. In Section 3, we documented the process which was used to test the data for our requirements and also described the steps used to clean and repair the data. In one case, we interacted with the data producer but the problem was not resolved nor documented as metadata by the data producer. The analysis and validation which we carried out required a certain level of expertise, effort and time. Although this effort pays off in the form of a working application or improved data quality for the traffic simulation tool, the process, our findings and fixes are not recorded nor reported which means that this type of effort will need to be replicated by each new consumer of the data. To reduce this, we propose a mechanism for crowdsourcing metadata about the quality of datasets similar to the collection of Volunteered Geographic Information (Goodchild, 2007). Using the wisdom of the data user crowd could create a more curated form of urban data and encourage greater engagement between data providers and consumers and enhance the reputation of open data portals. The proposed approach mimics the ethos of Wikipedia and OSM in which users of these websites can contribute and edit content. However rather than directly edit and contribute datasets, we propose that users can contribute and edit metadata to describe the veracity of a dataset and provide feedback about any processing that was applied to validate the data. This could be done using many of the recognised domain specific standards like those outlined in Section 2. The open data portal should provide the tools to facilitate and support this crowdsourcing of data veracity, along with a forum to discuss the data and give examples of where they have been used. Some urban data portals such as the Paris data portal do provide a 15 means for discussing datasets, however there is scope to extend its functionality as a reporting and sharing interface. The approach has been successful for OSM and Wikipedia and the editing of data is self-policed by members who form a community so that false or misleading information becomes rare. The approach is akin to the idea of civic hacking in which citizens want to improve services for all (Coleman and Golub, 2008; Perng and Kitchin 2015). While there are arguments for not sharing data veracity and processing experiences, such as gaining a competitive advantage, this has not been the case in the open data community. For example, the ODI have members who volunteer time to process open data to improve its usability by translating it into machine readable formats. Generally within the crowdsourcing community, individuals do not receive monetary reward for their efforts but receive recognition that their contribution is helping others while also increasing contributors’ profile as experts. This proposed approach echoes the more general move towards using crowdsourced data, collected both actively (volunteered) and passively, as a way of creating new official data and official statistics and improving existing government data (Goodchild, 2007; Lauriault and Mooney, 2014). Like Wikipedia, OSM and other crowd-sourced geographic data, which is inherently an unfinished product (Dodge and Kitchin, 2013), determining the veracity of a dataset will be an ongoing task as there are always novel and innovative uses of data for which new quality and veracity metrics will be required. Following our analysis and using the proposed approach, we would contribute the knowledge that we discovered about the inaccurate positioning of the SCATS sensors (using the ICA data quality methodology), the inaccurate travel times for the TRIPS data (using the Transport Science metrics) and the invalid speeds achieved seen in the POWSCAR dataset. This would benefit future users of these datasets and may lead to a revised dataset being made available by the data producers. It will also allow other data users to update their applications based on this new information. 5. Conclusion Our experience in the case studies documented and other data intensive projects (Gleeson et al, 2008, Kitchin et al., 2013, McArdle, 2014b; Calabrese, 2015) highlight several challenges related to the use of urban data regarding its validity, veracity, and reliability. Our experience is typical and shows how errors are handled, or not, by both data producers and consumers. 16 While there are metrics, methodologies and guidelines, and increasingly standards and certificates for measuring the quality and accuracy of data, our experience shows that these are not being widely used in urban open data portals. Data producers seem happy to provide data ‘as is’, without any guarantee regarding their quality or accuracy either due to laxities in their efforts or to avoid liability for inaccurate data. Doing this on a wide scale in open data portals is potentially dangerous and may lead to the urban data portals being regarded as unreliable by data consumers and critics. Moreover, it potentially jeopardises the intended economic and civic engagement benefits which are often the goals of such portals. While our examples are not exhaustive in terms of the types of veracity and quality issues for data, they illustrate how issues with accuracy and consistency were detected using analysis and highlight the need for developers to do such checking when no veracity or lineage metadata accompanies urban data. The reasons why data producers do not carry out such analysis or provide the full lineage of the data is an open question and a further study is required. Issues related to resources, expertise, skills, time and a risk of liability are likely to be cited as causes. Further, data are often provided without knowledge of all the possible end uses and so it is difficult to express data veracity across an exhaustive range of domains. In the absence of data providers carrying out quality analysis and providing detailed metadata and lineage information developers must assess data quality and accuracy for their specific needs. Our experience shows the effort required to carry out this process can be great. This effort is typically lost and becomes black-boxed or encoded in the resulting application or tool. The problem is likely to increase with the era of big data, with many providers such as local government departments being unable to maintain veracity metadata or quickly transitioning data. To resolve this issue, we propose that a data veracity community be developed around the use of open government data, including real-time data. This community can then curate the data by providing the metadata about veracity, the processing that they carried out in developing applications, and have discussions about the data with other consumers and the producer. This crowdsourcing approach would build on the spirit of sharing seen in the open data community and mimic that seen in the Wikipedia and OSM. This should lead to a greater trust in urban open data portals and result in improved smarter city applications and smarter evidence-based decisions. The next step is to design and integrate this proposed approach with an open data portal and we are exploring the possible implementation of such an endeavour with stakeholders. 17 Acknowledgements The research for this article was funded by a European Research Council Advanced Investigator award (ERC-2012-AdG-323636-SOFTCITY) and Science Foundation Ireland. References ANSSC (2015) Directory of Smart and Sustainable Cities Standardization Initiatives and Related Activities, American National Standards Institute Network on Smart and Sustainable Cities. http://publicaa.ansi.org/sites/apdl/ANSI%20Network%20on%20Smart%20and%20Sustain able%20Cities/ANSSC-Directory-of-Initiatives.pdf (last accessed 13 July 2015). Bates, J. (2012) "This is what modern deregulation looks like": Co-optation and contestation in the shaping of the UK's Open Government Data Initiative. The Journal of Community Informatics 8(2), http://www.ci-journal.net/index.php/ciej/article/view/845/916 (last accessed 6 February 2013) Baym, N.K. (2013) Data not seen: The uses and shortcomings of social media metrics. First Monday 18(10) http://firstmonday.org/ojs/index.php/fm/article/view/4873/3752 (last accessed 3 January 2014) Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 16. Berners-Lee. T. (2006) Linked data: Design Issues. http://www.w3.org/DesignIssues/LinkedData.html (accessed July 2015). Bollier, D. (2010) The Promise and Peril of Big Data. The Aspen Institute. http://www.aspeninstitute.org/sites/default/files/content/docs/pubs/The_Promise_and_Peri l_of_Big_Data.pdf (last accessed, 1st October 2012) Bruns, A. (2013) Faster than the speed of print: Reconciling ‘big data’ social media analysis and academic scholarship. First Monday 18(10), http://firstmonday.org/ojs/index.php/fm/article/view/4879/3756 (last accessed 3rd January 2014) Calabrese, F., Di Lorenzo, G., McArdle, G., Pinelli, F., and Van Lierde, E. (2015). RealTime Social Event Analytics. Proceedings of NetMob 2015, MIT Media Lab, Boston, USA. Choi, S., Kim, N., Cha, H., and Ha, R. (2009). Micro sensor node for air pollutant monitoring: Hardware and software issues. Sensors, 9(10): 7970-7987. Coleman, G. and Golub, A. (2008). Hacker Practice: Moral Genres and the Cultural Articulation of Liberalism. Anthropological Theory, 8(3): 255–277. Crampton, J., Graham, M., Poorthuis, A., Shelton, T., Stephens, M., Wilson, M.W. and Zook, M. (2012) Beyond the Geotag? Deconstructing “Big Data” and leveraging the Potential of 18 the Geoweb. http://www.uky.edu/~tmute2/geography_methods/readingPDFs/2012Beyond-the-Geotag-2012.10.01.pdf (last accessed 21st February 2013) EPA (2006). Data Quality Assessment: A Reviewer’s Guide, EPA QA/G-9R, EPA/240/B06/002. http://www.epa.gov/quality/qs-docs/g9r-final.pdf (last accessed 13th Aug 2015) Dodge, M. and Kitchin, R. (2013) Crowdsourced Cartography: Mapping Experience and Knowledge. Environment and Planning A, 45 (1), pp.19–36. Dublinked (2015) TRIPS Dataset http://dublinked.ie/datastore/datasets/dataset-215.php (accessed July 2015). Dublinked (2015a) SCATS Dataset http://dublinked.ie/datastore/datasets/dataset-305.php (accessed July 2015). Gleeson, J., Kitchin, R. Bartley, B., Driscoll, J. Foley, R., Fotheringham, S. and Lloyd, C. (2008) The Atlas of the Island of Ireland. AIRO/ICLRD, Naas. González-Bailón, S., Wang, N., Rivero, A., Borge-Holtoefer, J. and Moreno, Y. (2012) Assessing the Bias in Communication Networks Sampled from Twitter. Working Paper. http://arxiv.org/abs/1212.1684 (last accessed 17th January 2014) Goodchild, M. F. (2007). Citizens as sensors: the world of volunteered geography. GeoJournal 69(4), 211-221. Guptill, S. C. and Morrison, J. L. (Eds.). (1995). Elements of spatial data quality. Elsevier. Inspire (2015). EU INSPIRE Directive for Spatial Data. http://inspire.ec.europa.eu/ (accessed July 2015). Janssen, K. (2012) Open Government Data: right to information 2.0 or its rollback version? ICRI Working Paper 8/2012 http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2152566 (last accessed 14th August 2013) Kitchin, R., Gleeson, J. and Dodge, M. (2013) Unfolding Mapping Practices: A New Epistemology for Cartography. Transactions of the Institute of British Geographers 38(3): 480–496 Kitchin, R., Lauriault, T. P. and McArdle, G. (2015). Knowing and governing cities through urban indicators, city benchmarking and real-time dashboards. Regional Studies, Regional Science 2(1), 6-28. Lauriault, T. P. and Mooney, P. (2014). Crowdsourcing: A Geographic Approach to Public Engagement. Programmable City Working Paper 6. Available at SSRN 2518233. Martín, A. S., De Rosario, A. H. and Pérez, M. D. C. C. (2015). An International Analysis of the Quality of Open Government Data Portals. Social Science Computer Review. 19 Mayer-Schonberger, V. and Cukier, K. (2013) Big Data: A Revolution that will Change How We Live, Work and Think. John Murray, London. McArdle, G., Demšar, U., van der Spek, S. and McLoone, S. (2014). Classifying pedestrian movement behaviour from GPS trajectories using visualization and clustering. Annals of GIS 20(2), 85-98. McArdle, G., Furey, E., Lawlor, A. and Pozdnoukhov, A. (2014). Using digital footprints for a city-scale traffic simulation. ACM Transactions on Intelligent Systems and Technology (TIST) 5(3): 41. McArdle, G., Lawlor, A., Furey, E. and Pozdnoukhov, A. (2012). City-scale traffic simulation from digital footprints. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing (pp. 47-54). ACM. McKinsey (2013) Open data: Unlocking innovation and performance with liquid information. http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation _and_performance_with_liquid_information (accessed July 2015). ODI (2015). Open Data Certificate. https://certificates.theodi.org/ (accessed July 2015). Open Knowledge Foundation (2012). Open Data Handbook Documentation, 1.0.0. http://opendatahandbook.org/ (last accessed 16th August 2013). Perng, S. Y. and Kitchin, R. (2015). Solutions, Strategies and Frictions in Civic Hacking. Programmable City Working Paper 10. Available at SSRN 2606939. Pollock, R. (2006). The value of the public domain. IPPR. http://www.ippr.org/publication/55/1526/the-value-of-the-public-domain (accessed July 2015). Rieser, M. (2010) Adding Transit to an Agent-Based Transportation Simulation: Concepts and Implementation. PhD thesis, VSP, TU Berlin, Germany. Turner, S. (2002). Defining and measuring traffic data quality. In Proceedings of the Traffic Data Quality Workshop, Washington, DC. Shi, W., Fisher, P. and Goodchild, M. F. (2003). Spatial data quality. CRC Press. Umbrich, J., Neumaier, S., & Polleres, A. (2015). Towards assessing the quality evolution of Open Data portals. In ODQ2015: Open Data Quality: from Theory to Practice Workshop. Munich, Germany (Mar 2015), http://polleres. net/publications/umbretal-2015ODQ.pdf. 20
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Sevcan Demir Atalay
Ege University
Prof.Amjad Al-Nasser
Yarmouk University
Gauss Cordeiro
Universidade Federal de Pernambuco
Saiful Mahdi
Universitas Syiah Kuala Banda Aceh