Session 23

Back to Schedule

Title of session: Approaches for multi-source statistics

Chair: Mátyás Mészáros

Room: S3B Sukiennice

Time: 14:30 - 16:00

Date: 28 June

Session 23 - papers & presentations

Presenting AuthorAbstract
Antonella Baldassarini
Title: <<< Enhancing the quality of National Accounts in the estimation of proceeds from illegal markets >>>
National Institutes of Statistics need to assess the size of some illegal activities considered as a substantial part of the illegal economy in order to understand their value and impact for society. At European level, it has come to the decision to include illegal activities that produce goods and services in National Accounts. The accounting data provided to the European authorities by each country must include the income generated by the markets of drugs, prostitution and the smuggling of cigarettes and alcohol. Currently, in the GDP, several European National Institute of Statistics estimate the proceeds of illegal drug markets on the demand side (data on the number of consumers and quantities consumed), prostitution on the supply side (data on the number of prostitutes on the market) and cigarette smuggling also on the supply side (data on seizures). In this work, we aim at providing an accurate estimate of the size of these illegal markets. Moreover, we propose a uniform procedure for measuring the flow of illegal proceeds in the GDP. The estimation methodology applies in steps: first we calculate the hidden population of illegal workers in illegal markets of drug, prostitution and smuggling to give a more precise dimension to these markets. To this purpose, we exploit administrative data coming from the Ministry of Justice, recording criminal charges of Public Prosecutor's offices. Due to privacy motivations, we only have soft personal information about the criminals. However, we can consider the administrative source as a list of criminals and by this personal information we are able to count how many times a criminal appears in the list. The main question is how many criminals are missed by the justice system but active in the illegal markets and we provide methodologies to answer to this point.
Dimitris Pavlopoulos
Title: <<< Integration of inconsistent data sources using Hidden Markov Models >>>
Latent class models (LCM) are increasingly used to estimate and correct for classification error in categorical register and survey data, without the need for a “gold standard”, error-free data source. To accomplish this, LCMs require multiple measures of the same phenomenon within one data collection wave (“latent structure model”), or over time (“hidden Markov model”), and assume that the errors in these measures are conditionally independent. Unfortunately, this “local independence” assumption is often unrealistic, untestable, and a source of serious bias. However, linking independent sources can solve this problem by making the assumption plausible across sources, while potentially allowing for local dependence within sources. Thus, while an attractive method for the production of more accurate official statistics, this procedure is very complex and time consuming. More specifically, the use of LCM often requires performing linkage between different data sources and re-estimating the model for each new time period. What is more, data linkage might lead to linkage error and subsequently to biased estimates. In our research we investigate the feasibility of using HMMs specifically in the production of labor mobility estimates in the Netherlands using register and survey data. We do so by first looking at the possibility of parameter re-use and then by analyzing model sensitivity to linkage error. The results suggest that the HMMs error estimates are time invariant and, therefore, can be re-used in later time points without the need to re-link the datasets and re-estimate the statistical model. The results also show that linkage error only leads to (substantial) bias in very extreme scenarios and that HMMs, to an extent, can correct for false-positive linkage error. It would be very nice to let this presentation be preceded by Bakker and Zult et al. in the same session. The first paper introduces the topic.
Domenico Fabio Savo
Title: <<< On the Experimental Usage of Ontology-based data access for the Italian Integrated System of Statistical Registers: Quality Issues >>>
Ontology-based data access (OBDA) is a recent paradigm for addressing data management based on a conceptualization of the domain of interest, called ontology. A system realizing the vision of OBDA is constituted by three layers: the ontology, that provides a high level, formal, logic-based representation of the above mentioned conceptualization; the data source layer, representing the existing data in the various assets of the system; the mapping between the two layers, which is an explicit representation of the relationship between the data sources and the ontology. Most works on OBDA focus on querying data through the ontology. However, recent papers argue that OBDA is a promising tool for assessing the quality of data, especially in the presence of multiple, possibly mutually incoherent data source. For example, with the OBDA approach, checking data consistency reduces to verify whether the data sources contain data contradicting the axioms constituting the ontology. We have experimented the above approach for a current project of Istat, namely the Italian Integrated System of Statistical Registers. We have focused on the domain of population data, and we have built an OWL ontology for modeling basic concepts and relationships of this domain, including persons, families, parental relations, citizenship, locations, etc.. Then, we have considered a core set of population data and we have specified the mappings from such data sets and the ontology. With such a specification at hand, we have used the MASTRO system for OBDA for carrying out several data quality checks, focusing in particular on the consistency dimension. The preliminary results are extremely encouraging, both in terms of effectiveness of the method and in terms of efficiency of the checking procedures, in the sense that the performance of the quality check is not affected by the (usually expensive) task of reasoning over the ontology.
Raffaella Traverso
Title: <<< Data quality and data integration >>>
Data integration allows users to have a unified view of data coming from different data sources. However, data sources can be characterized by several types of heterogeneities such as different data models, different data representations and different quality at the level of data instances. As the widely accepted definition of Data Quality is “fitness for purpose”, we pose the question: how is the quality of the source data affected when data are combined to satisfy the integration purpose? In the process of integrating data, problems in the quality of data sources become more evident. These are measured by relevance, accuracy, reliability, timeliness, consistency of the data as well as by the completeness, validation and relevance of the structure of the metadata attached to the data sources. Through the use of practical examples we present the challenges posed by data integration and how data sources are thereby affected.

Back to Schedule

Font Resize