The need for a comprehensive Open Data Quality framework

The need for a comprehensive Open Data Quality framework

The world is pushing public and private enterprises for greater openness and transparency. Various initiatives and treaties have been struck to ensure that the needed data is available in public for more scrutiny and exposure. However, has it really resolved the accessibility and public query around integrity of enterprises? The emphatic answer is we are not there yet. The release of data has rather multiplied the barriers in use of data.

 While we walk through the situation, I will share my ordeal with a simple problem of identifying city density in Indian states. Some of the challenges that we face when it comes to exploring data are:

 

  • Limited release of data: The data often released on public portal is limited by time frame, geographies or entities. Let us take example of any program data available on Open Data portal. While some Indian states data has been published, many have yet not been published. 
  • Accessibility: Visit any government website; one can see data in uncommon file formats such as scanned PDFs or data with some unconventional signs, which makes it difficult for developers to exchange or consume data directly by machines. Israeli government publishes its data in Hebrew, which creates problem for data journalist or researchers not familiar with the language to collaborate.

“Data! Data! Data!” he cried impatiently.
“I can’t make bricks without clay.”
— Sir Arthur Conan Doyle, The Adventure of the Copper Beeches

  • Out dated data: Let us take example of “Department wise expenditure Incurred on Pay and Allowances Civilian Employees by Central Ministries Department” as published on Open Data portal of India. As on 20th Jun 2020, the data is available for year range of 2009-10 to 2015-16. In absence of non-current data, it is hard to come up with conclusion regarding current state of affairs. 
  • Incomplete Open Data: There a 2 challenges here –
    • Not every department is forthcoming in releasing its data on Open Data platform. The data may have been made public burried somewhere deep in department’s website but not published on Open Data portal. It is however not convenient for seeker to obtain the data. It requires good familiarity with department specific website as each department has its own preference of designing website with no commonness of theme. Secondly, search option on the website is mostly useless as it enlists search results in any random order without much relevance to such criteria.
    • Have a look at Post Office Latitude Longitude details published on Open Data portal. The title of dataset (“All India Pincode directory with contact details along with Latitude and longitude”) promises for a exhaustive list of PIN codes along with latitude and longitude details. However, 1,54,655 out of 1,54,797 PIN code locationa do not have latitude and longitude details. If this be the case,  concerned data officer could have changed the title of dataset or not published a dataset that is not meaningful.
  • Incorrect Open Data: It is already a subject of debate that data from different sources vary, where sources claim to have picked data from the originator for the same duration keeping all parameters same. One of the cited data for location latitude and longitude details is GeoNames. I could not find such exhaustive dataset from any other source so easily. Plotting latitude and longitudinal details over map raises questions on its veracity. One may have a look at the code at our github repository here and resulting plot as under for 4 cities. Several districts and areas are found in altogether different state with deviation as high as several 100 miles. Some district are even outside the country’s map. Hover over points on map for more details. It lets us conclude that due diligence on data has not been done

Many developers are building libraries on top of such data. For example pgeocode is built on top of GeoNames database. Developers trying to use such unverified libraries may get incorrect application outcomes.

  • Priced data: Some organizations such as Google, MapQuest, GeoCode and others are also providing latitude and longitudinal data. However, they are available in a freemium or paid only model. I am not even debating their right to charge money for such data services; as infrastructure investments are required to maintain such services. Rather oversight of government bodies that prevents enterprising individuals from getting creative is the issue here.
  • Data repositories not working or not available: Lastly I turned to possibility of ISRO (Indian Space Research Organization) publishing such data for free use. The “Data Discovery” link on Bhuvan was unfortunately not working. 

The problem is that there is no common measure of data quality for publishing Open Data. As a result, recommended guidelines leave out the process of sourcing datasets and ensuring key quality parameters. Some of these data quality parameters are listed as under:

    1. Completeness of datasets
    2. Accessibility of data sets
    3. Find-ability of data
    4. Machine readability of data
    5. Timely publication
    6. Understand-ability of data
    7. Content structure

To create data that has an impact, government should also ensure measures for data quality. These may include publishing the datasets with KPI that indicates its quality. Obviously, having a KPI calls for a measurement framework that government needs to publish.

 

Further Readings

Leave a Reply

Close Menu