Special Session 32

Back to Schedule

Title of session: Addressing quality in the context of big data - the ESSnet contribution

Chair: Albrecht Wirthmann

Room: S4B Lajkonik

Time: 08:30 - 10:00

Date: 29 June

Session 32 - papers & presentations

Presenting AuthorAbstract
Giulio Barcaroli
e-mail: barcarol@istat.it
Title: <<< Quality evaluation of experimental statistics produced by making use of Big Data >>>
In 2017 the Italian Institute of Statistics (Istat) has started the production of a set of experimental statistics based on the use of Internet data, one of the most relevant Big Data sources. These statistics refer to the activities that enterprises carry out in their websites (web ordering, job vacancies, link to social media, etc.) and are a strict subset of those currently produced by the “Survey on ICT in enterprises”. The idea is to calculate these estimates by making use of the websites content, that is collected by using web scraping tools, and processed by applying text mining techniques. Then, models are fitted in the subset of enterprises for which both survey reported values and relevant terms obtained by the web scraping/text mining procedures are available. Experimental statistics have been obtained by making use of two different estimators: the first one is a full model based estimator; the second one is an estimator that combines model based estimates and survey estimates. Considering the various domains for which they have been calculated, the three sets of estimates (survey, model and combined) in most cases are not distant (i.e. model and combined estimate values lay in the confidence intervals of survey estimates), but in some other they do are.
The question is: how to evaluate the accuracy of the three sets in order to understand if experimental statistics can substitute survey ones?
Considering the different factors that can produce bias in survey estimates (total non-response and response errors) and in alternative estimates (population under-coverage and prediction errors), these factors are analysed in detail with respect to the real conditions in the 2017 experience. Finally, a simulation study is carried out in order to investigate the conditions under which a given estimator performs better than the others.
Christina Pierrakou
e-mail: c.pierrakou@statistics.gr
Title: <<< AIS: Defining a ship’s journey and sea traffic analysis >>>
Ships broadcast information on their location and status on a frequent basis by means of a radio signal. This so-called Automatic Identification Signal (AIS) provides a big data source for maritime and emission statistics. Research on the use of AIS for official statistics is part of the ESSnet Big Data project, performed by The Netherlands (Work package leader), Denmark, Greece, Norway and Poland. Here, we present methods to define a ship’s journey and to perform sea traffic analyses. Defining the journey of a ship by using AIS data is needed to obtain insight into all the ports the ship visited. It is also makes it possible to calculate the distance a ship travels. Furthermore, a ship’s journey is necessary to improve the calculation of emissions. To determine the journey of a ship, we further build on the algorithm we already developed to determine a port visit. The resulting journey algorithm is output-driven and enables us to define the start of a journey and to deal with noise in the signal. For traffic and economic analyses, we also wanted to explore the possibility of calculating the number of ships during a certain time interval at certain coordinates by using AIS data. We calculated the traffic intensities of ships around Europe. An important prerequisite is that the grid elements all should have the same size, thus the grid was defined as areas of 10,000 square kilometres. Different coordination systems were compared, with the WGS1985 system being selected. To draw the data to a map, the Lambert Azimuthal projection was used. This method preserves surface area under the transformation. The final visualization is done in Shiny in combination with Leaflet.
Jacek Maślankowski
e-mail: j.maslankowski@stat.gov.pl
Title: <<< Big Data quality issues regarding multi-domain statistical data combining – a survey and case studies >>>
Employing Big Data methods and tools to produce statistical data makes the necessity of the revision of the data quality framework for official statistics. Although many different efforts have been made, including UNECE Big Data Quality Framework or different approaches in research papers, there is no unified Big Data quality framework that can be applied for different type of data sets, such as social media or large structured data sets. On the other hand, the variety of Big Data quality frameworks allows creating the set of quality indicators that will assess different aspects of the data source usability. Therefore, the solution is to create different frameworks depending on the data set used. It is rather easy when one dataset is used. More complicated is when different data sets are integrated, including various data types. The aim of the paper is to show how Big Data quality frameworks can be applied to create the set of indicators that will allow assessing the data set quality in three different stages – as input data sources, during processing phase (data sources integration) and when producing the output data (final experimental tables). The paper covers different aspects of Big Data integration. The first is intra-domain when combining data sets within three different statistical domains: population, tourism and agriculture. It includes combining data from traditional surveys and Big Data sources, such as social media data or satellite data. The second aspect is when combining inter-domain data sources. We have tested data integration by combining population and tourism data sets. The case studies and pilot surveys allows creating original conclusions on how to measure the data quality and which quality indicators can be applied to provide reliable assessment of the data sources and results.
Michał Bis/Anna Bilska/Eleni Bisioti
e-mail: M.Bis@stat.gov.pl/A.Bilska@stat.gov.pl/e.bisioti@statistics.gr
Title: <<< Coverage of AIS data: comparison of privately held to national datasets - Poland and Hellenic experiences >>>
AIS data, real-time measurement data of ship positions, is one of the potential Big Data sources investigated by the ESSnet Big Data project. The main aim of the specific work package (WP4) is to explore the potential use of AIS data in the production of official statistics, due to their advantage of being generic worldwide and obtainable at European level. Five National Statistical Institutes participate in WP4: the national statistical institutes of the Netherlands (Work package leader), Denmark, Greece, Norway and Poland. This paper outlines the results of the comparison of national public AIS data to privately held AIS data, in terms of quality and metadata. Various Big Data technologies were used, to store, manage and process the huge volumes of AIS data such as the Distributed File System of Hadoop, the large-scale data processing engine Apache Spark, Scala language, the noSQL database Elasticsearch using GeoData. Moreover, the exploration and visualization tool Kibana was used. The main conclusion drawn from this paper is that AIS is a big data source with great potential to improve official statistics, however more work is needed in the area of ensuring transparency and soundness of methods and processes for the privately held data to be incorporated in the statistical production.
David Salgado
e-mail: david.salgado.fernandez@ine.es
Title: <<< Estimation of population counts combining official data and aggregated mobile phone data >>>
Beyond doubt, mobile phone data stand as one of the most promising Big Data sources for the production of official statistics. In consonance, in the recent ESSnet on Big Data participated by 22 partners of the European Statistical System (ESS) a work package was completely devoted to the access to these data, the development of statistical methodology, the analysis of IT tools and of quality issues to make this promising information source become a regular resource in the production of official statistics. We offer a summary of the works conducted in this work package, going from the intricate issue of accessing diverse forms of mobile phone data (microdata/aggregated data) over setting up an inferential framework to use aggregated mobile phone data in combination with official data to produce population counts, to the development of some IT tools for providing a proof of concept and first analytical results upon real data. All these enter as relevant factors in the quality assessment of the final estimates. As explained in the results of the ESSnet, although we have been able to collect enough real data as to conduct the analytical study, the access to mobile phone data is still an open question which needs further work within the ESS and the European Union. A first set of conclusions and guidelines for partners of the ESS have been obtained. Regarding the statistical methodology, unable to use traditional survey sampling techniques, we have explored the use of hierarchical statistical models as in ecological sampling to propose a generic inferential framework for the counts of diverse target populations (commuters, resident tourists, inbound tourists, general population,…). The analysis is completed providing software tools to implement this methodological proposal, showing a proof of concept with both simulated and real data, and assessing the quality of the final estimates.

Back to Schedule

Font Resize