Session 24

Back to Schedule

Title of session: Data integration methods and architectures

Chair: Włodzimierz Okrasa

Room: S4A Mariacki

Time: 10:30 - 12:00

Date: 29 June

Session 24 - papers & presentations

Presenting AuthorAbstract
Daan Zult
Title: <<< Improvement of the capture-recapture model to estimate under coverage of administrative data sources >>>
The under coverage of an administrative data source is the true population size minus the observed population size. The true population size is often estimated with the capture-recapture (CRC) model. An important requirement for the CRC model to hold is that specimens over different captures can be identified such that the observer knows whether they are the same or not (i.e. no linkage errors). This is particularly relevant when identification is not obtained by some unique identifier (like a tag or id-code) but by background information (like name, address or skin patterns). Linkage based on background information is usually probabilistic and can lead to different specimens being falsely linked or the same specimen being missed as link, which gives biased population estimates. A partial solution to this problem was provided by Ding and Fienberg (1994) and was later extended by Di Consiglio and Tuoto (2015). These authors show how to use linkage probabilities to correct the CRC estimator. De Wolf et al. (in progress) further extend this model and shows that all these models are special cases of a more general model. However, all these correction methods are designed for only two registers and no use of background variables (such age sex or age), which implies that correcting for linkage errors cannot be combined with other corrections that are often required to correct for other sources of bias. In this paper we reformulate the general model of de Wolf et al. into a Weighted-CRC (W-CRC) model. W-CRC can deal with multiple captures and background characteristics. We further show how the W-CRC model works in practice.
Antonio Laureti Palma
Title: <<< Change is the law of life. Prepare your data warehouse for a Big Future, by including Big Data >>>
National Statistical Institutes, NSIs, produce statistics based on consolidated statistical models according to the study-domain characteristics. Conversely, nowadays, Big Data (BD) opportunities force NSIs to deliver statistical products based on a sequence of data analytic processing. This feature means that algorithm experts may abstain from being informed theoretically about the subject matter; i.e. in data analytics we can say that knowledge extraction is data driven. From this perspective, NSIs multidisciplinary approaches are crucial. This is because they must be able to combine different areas of expertise for the mapping of analytical constructs with theoretical concepts. This is an epistemological process change that forces NSIs to meet some basic requirements at the procedural, organizational and infrastructural levels. At the process level, the use of BD introduces a paradigm shift, typical of data mining, from theory to data driven models. At the Organizational level, it is necessary to guarantee and support the active participation of multidisciplinary experts in the process of knowledge extraction. At the infrastructural level, new and complex infrastructures are needed to support both analytical tools and multidimensional analysis. Typically, in a NSI the central repository of integrated data is the corporate Statistical Data Warehouse (S-DWH); this stores current and historical data in one single place and it is used by knowledge workers. To manage the opportunities arising from BD the S-DWH must address the needs of new data types, new volumes, new data-quality levels, new performance, new metadata, and new user requirements. The S-DWH will be data-driven and extremely flexible and scalable. In this work we aim to present some S-DWH architectures involving integration with the typical prospective BD as possible statistical sources: sensor data, webscraped data, mobile phone data. All the solutions presented aim to facilitate a multidisciplinary approach to improve statistical process quality.
Sandra Barragán Andrés
Title: <<< Combining information from different sources to estimate the annual working hours of part-time workers >>>
Structural statistics on earnings provide comparable information on relationships between the level of earnings, individual characteristics of employees (sex, age, occupation, length of service) and their employer (economic activity, size of the enterprise). In Spain there is a main survey carried on 4-year periodicity. Additionally, due to administrative records, most of the information can be given annually as well. However, some problems arise when different administrative records are used together and joined with survey results to obtain the final dataset. Administrative data were originally collected for a definite non-statistical purpose that might affect the quality of the data. In addition, there is a coherence issue when more than one source of information are integrated to obtain data for the same variable. Coherence is an essential part of official statistics so it is receiving such a growing attention to ensure quality in the data. Statistical offices focus on the importance of obtaining coherent results specially when they come from a process of data integration. Along this process, we deal with the inconsistency of the individual values by taking into consideration the additional uncertainty due to the difference between the administrative concept and the statistical variable. Therefore, we present the methodology used in the last publication of the structural earnings survey to obtain the annual working hours of part-time workers. We have combined three sources of information in order to obtain a trustworthy value of the working hours for each individual. We have developed an algorithm with decision rules built by using expertise on the field. This algorithm is an initial solution to maintain coherence at microdata level in the combination of different sources for the same variable.
Tuukka Saranpää
Title: <<< Data Architecture for Statistics Production – Logical Data Repositories >>>
Traditionally, data repositories at Statistics Finland have been designed for the need of one or a few specific statistical areas. Additionally, development projects on data repositories have typically assumed a systems-oriented approach, focusing on the technological solutions required. Over time, the lack of top-down coordination has resulted in overlap, inconsistency, poor opacity and impaired quality between different data repositories. These issues, along with new national guidelines for Finnish public administration, prompted Statistics Finland to organize a preliminary study to outline the fundamental requirements for a common statistical data architecture as well as a roadmap for its development. In January 2017, following the findings of the preliminary study, Statistics Finland launched a development program on data architecture for statistical production to coordinate the various projects outlined in the roadmap. In order to ensure consistency and interoperability between different development projects, a top-level to-be model for logical data repositories for statistics production was required. In developing the model, Statistics Finland assumed a holistic top-down view on statistical production, shifting the focus from statistics to data. The model envelops the entire statistics production process (Generic Statistical Business Process Model phases 4-7,) from data acquisition to dissemination and determines the names and general constitution of all relevant top-level data repositories. Some key observations of the new model are the emphasized role of metadata through the process and a centralized top-level repository for geospatial data serving all areas of statistics production. The newly-developed model provides a top-level framework for any development effort on a specific area. However, to truly capture the benefits of a common data architecture, significant bottom-up development work is required to determine the exact content and physical infrastructure of each repository. Furthermore, the top-level model may be improved based on observations and experiences from lower-level development.
Piero Demetrio Falorsi
Title: <<< Modernizing Data Integration Systems at Istat >>>
Istat has engaged a modernization programme that includes a significant revision of the statistical production. The principal concept underlying such an important change is the usage of a system of integrated statistical registers as a base for all the production surveys; this system will be in the following referred to as the Italian Integrated System of Statistical Registers (ISSR). The ongoing work for building such a system required a big investment on architectural aspects to guide the enterprise-level design of the system. In this respect, there are two major activities that have been engaged: (i) the design of the data architecture of the ISSR and (ii) the design of the processes to populate the ISSR. The data architecture of the ISSR has been conceived according to modern semantic integration approaches, with a strong emphasis on active metadata guiding the access to the system. The design of the processes has been performed by relying on existing standards, like GSBPM and GAMSO, and frameworks, like ESS Enterprise Architecture Reference Framework. In this paper, we will describe both strands of work, focusing on how quality aspects have been taken into account and specifically on the role played by both official statistics and world-wide technical standards to ensure the quality of data, processes and underlying information systems.

Back to Schedule

Font Resize