Training course 3

Back to Schedule

Theme of the course/workshop: Big Data in Official Statistics

Facilitators/ Instructors: Jacek Maślankowski

Materials from this course


Time slot


09:30 – 09:45

Introduction and objectives of the course

Presentation of the lecturers and participants

Morning session

09:45 – 11:00

Web scraping – history, tools, types of web scraping

Acquiring data from web – manual vs. automatic tools

Quality issues in web scraping – sustainability, coverage and representativeness

Examples and exercises

11:00 – 11:30

Coffee Break

11:30 – 13:00


Web scraping sem-structured data

Combining two different we data sources – de-duplication issues

Examples and exercises

13:00 – 14:00

Lunch Break

Afternoon session

14:00 – 15:30

Machine learning fundamentals – supervised vs. unsupervised learning.

Examples of text and numeric data.

Text mining – processing high quality text data for machine learning.

Examples and exercises.

15:30 – 16:00

Coffee Break

16:00 – 17:30

Machine learning with web data – how to prepare a good training dataset.

Quality aspects of machine learning.

Examples and exercises.

Description and objectives of the course:


Show the fundamentals of the use of Big Data in official statistics using three different aspects of Big Data: web scraping, text mining and machine learning. The course will be based on practical examples and exercises that will allow participants in better understanding the concept of the use of Big Data in official statistics. Examples will be based on real issues of data gathering and processing for official statistics.

Participants’ profile

Should have a basic knowledge about the concept of Big Data. No need to have any programming skills – all examples in Python will be conducted with instructor. Should be familiar with the basic use of IT tools.

Overall description and approach

The workshop will be based on examples – participants will do exercises and run examples in Python language. Fundamentals of web scraping and machine learning will be provided with practical examples to use in official statistics. The goal is to identify the risk with web scraping and machine learning regarding the use in official statistics in terms of the data quality.

Facilitators/ Instructors (short biographical note)

Jacek Maślankowski, Ph.D., is a researcher and academic teacher at the Department of Business Informatics, University of Gdańsk and consultant in Statistical Office in Gdańsk (Statistics Poland). His research activities mostly concentrate on Big Data and Data Warehousing with Business Intelligence. He is an author of numerous publications regarding Big Data appliances. He was involved in Official Statistics in several Big Data projects, including contribution to the Big Data Quality Framework by UNECE. Currently he is a delegate to the ESSNet Big Data work packages: WP2 – Web Scraping Enterprise Characteristics (member), WP7 – Multi Domain (leader of methodology, consultant) and WP8 – Methodology (member, internal coordinator of IT Report). He is the author or co-author of statistical software, including Big Data Social Media Presence and Life Satisfaction.

Back to Schedule

Font Resize