Chair: Kari Djerf
Room: S4B Lajkonik
Time: 10:30 - 12:00
Date: 29 June
|Title: <<< Register-based estimation of total dwellings and households >>>
Total of dwellings and households are key statistics of a country. Traditionally these figures are produced based on population and housing census, which can be costly. How to produce these statistics based on statistical data originated from the relevant administrative sources is currently a major challenge to the Census Transformation Programmes at a number of European countries. We study the matter based on the Norwegian Address Register and Population Register. A particular difficulties arises due to the fact that dwelling identification at multi-dwelling addresses is problematic, owing to quality issues in the input sources, despite reliable identification of all the addresses. We develop two extensions to the existing capture-recapture methodology in this context. The first one can be characterised as a two-step approach, where one first applies standard capture-recapture methods to obtain an estimate of the resident addresses (i.e. address at which there exists dwellings and possibly households), and then use various missing-data methods to estimate the number of dwellings and households per address. The second approach can be formulated in terms of a log-linear model, under which it becomes possible to estimate at once the sizes of three populations, namely of resident address, dwelling and resident dwelling (i.e. dwelling household). We demonstrate both the approaches using real-life data. These provide potentially options for purely register-based estimation of total dwellings and households, instead of the costly census.
|Title: <<< Applications of multiple tests to improve the quality of multi-source data sets. >>>
In official statistics we often use multi-source information like desk researches, administration sources, censuses, reporting and others. It is necessary to estimate these kinds of information. This estimation could be based on making decision of possibilities of integration of these data sets. In our considerations, treating the different data sets as samples, we present the application of multiple testing in estimation of quality of multi-sorce data sets. From statistical experience the stepwise multiple test procedures will be suitable. We use more powerful step-up and step-down procedures which cotrol the error rates in multiple inference.
|Title: <<< Exploiting Auxiliary Data: Random forest regression estimator >>>
Suppose we have, in addition to our survey data, some auxiliary data A that are related to our target variables, but we don't know how. The source of the data could be, for example, administrative data or even Big Data, and it covers an important subset of the target population. An important question arises: how to exploit the data to improve our estimations? The increase of the accuracy would allow a reduction in the sample size of the survey. We propose a new regression estimator based on random forests: the Random Forest Regression Estimator. Being similar to the GREG estimator it has the same advantages that random forest regression has over linear regression. In particular, the relationship between A and target variables is learnt directly from data and both discrete and continuous explanatory variables can be used directly. This also means that the method easily accommodates a change in the structure of the auxiliary data. The estimator is nearly unbiased and we give an approximate estimator of its variance for stratified random sampling. It is important to note that the estimator remains unbiased even if the model is poor. Several simulations are run, with both synthetic and real data to show its performance in practice. Finally, several different applications to official statistics are proposed. Aside from sample size reduction, if you use random forests for imputation, you can use the estimator to correct the bias from aggregate means or totals, what constitutes its more interesting secondary use.
|Claudia De Vitiis
|Title: <<< Assessing and adjusting bias deriving from mode effect in mixed mode social surveys >>>
The mixed mode (MM), i.e. the use of different collection techniques in one survey, is a relatively new approach for ISTAT, especially for social surveys. It is adopted both to contrast declining response and coverage rates and to reduce the cost of the surveys. Nevertheless, mixed mode introduces several issues that must be addressed both at the design phase, by defining the best collection instruments to contain the measurement error, and at the estimation phase by assessing and treating the bias effects (mode effect) due to the use of MM, in order to ensure the accuracy of the estimates. Mode effect refers strictly to measurement error differences due to the mode of survey administration, but, when modes are assigned not randomly, a selection effect can generally occur and appropriate inference methods to evaluate mode effect are needed because the two types of error are confounded. Disentangling selection and measurement effects requires auxiliary information that are assumed to be mode insensitive, acquired from registers or collected by the survey itself. The problem of the selection effect can be faced with adjustments based on Propensity Score (PS) approach, allowing to mitigate the confounding effects of the selection mechanism and evaluate correctly the measurement error within homogeneous groups of units. The focus of this work is the experience in the evaluation and treatment of MM effect in the experimental situation of ISTAT “Aspects of daily lifesurvey- 2017”, a sequential CAWI/PAPI survey for which a control single mode sample PAPI was planned to make an assessment of mode effect on two independent samples with different techniques. Methods to assess the impact of MM on the quality of the estimate, the representativeness of the two samples and models to evaluate the measurement error and selection effect in the MM sample are experimented.
|Title: <<< Estimation of the standard error for net changes with the EU Labour Force Survey – How can users independently and appropriately calculate standard errors and confidence intervals? >>>
The EU Labour Force Survey (EU-LFS) is one of the most important official surveys for comparative social research in Europe. As such it is a source for the estimation of indicators to monitoring economic and social policy. To assess if observed changes of indicators are significant or not variance estimation for the estimated changes is required. This task is challenging as most countries use complex sampling designs and different rotation schemes. Due to their partial overlap between waves, rotational panels allow a more efficient estimation of changes. To account for this the covariance of cross-sectional estimates has to be estimated. In practice, users can face some difficulties in doing so because time-consistent identifiers are required. However, the data for scientific purposes released by Eurostat currently contains identifiers for the primary sampling units that are randomized per dataset and are only consistent for one year, supporting the erroneous assumption of statistical independence between waves. By taking the example of LFS-data from Austria we show how the available design information can be used to estimate the variance for change in cross-sectional indicators. For this we use the method proposed by Berger & Priam (2016), which represents a solution to the variance estimation problem in the presence of incomplete sampling design information. Statistics Austria releases files for the Austrian microcensus which can also be used for the Austrian LFS to solve restrictions by longitudinally inconsistent identifiers. This enables an empirical examination of the error, which can occur as a consequence of the erroneous assumption of independent samples. There by we can show how proper variance estimation is feasible. Therefore, we recommend that variables for stratum, clustering, weight, and completely time consistent unit identifiers should be released if there are no confidentiality concerns. This would considerably improve variance estimations based on anonymised microdata.
|Title: <<< Adjusting the gender pay gap using the Structure of Earnings Survey data >>>
Reducing the gender pay gap (GPG) is one of the key priorities of gender policies at the EU and national levels. At the EU level, the European Commission prioritised "reducing the gender pay, earnings and pension gaps and thus fighting poverty among women" as one of the key areas in its "Strategic engagement for gender equality 2016-2019". The unadjusted GPG, calculated as the relative difference between the average earnings of women versus men, is widely used in this context as the key indicator to monitor and evaluate progress in this area. However, the unadjusted GPG entangles in its measurement both possible discrimination between men and women, in terms of "unequal pay for equal work", as well as the impact of differences in the average characteristics of men and women in the labour market. Against this backdrop, Eurostat has developed a methodology to adjust the GPG using the Structure of Earnings Survey (SES) microdata. The methodology is based on the Blinder-Oaxaca decomposition. The SES microdata provide information on the earnings of individual employees as well as on some personal, job and enterprise characteristics. Eurostat's project provides a decomposition of the difference between male and female earnings into explained and unexplained parts. The explained part is the gap between male and female earnings which is due to the differences in the average characteristics (sector of activity, age, occupation, etc.) of male and female employees. The unexplained part measures the difference between the financial returns to men and women with the same characteristics. Eurostat's methodology and results should help policy makers to better interpret the unadjusted GPG. They should also stimulate further discussion within the European Statistical System on a common method to adjust the GPG indicator.