Time slot |
Topic |
09:30 – 09:45 |
Introduction and objectives of the course Presentation of the lecturers and participants |
Morning session |
|
09:45 – 11:00 |
Web scraping – history, tools, types of web scraping Acquiring data from web – manual vs. automatic tools Quality issues in web scraping – sustainability, coverage and representativeness Examples and exercises |
11:00 – 11:30 |
Coffee Break |
11:30 – 13:00
|
Web scraping sem-structured data Combining two different we data sources – de-duplication issues Examples and exercises |
13:00 – 14:00 |
Lunch Break |
Afternoon session |
|
14:00 – 15:30 |
Machine learning fundamentals – supervised vs. unsupervised learning. Examples of text and numeric data. Text mining – processing high quality text data for machine learning. Examples and exercises. |
15:30 – 16:00 |
Coffee Break |
16:00 – 17:30 |
Machine learning with web data – how to prepare a good training dataset. Quality aspects of machine learning. Examples and exercises. |
Description and objectives of the course:
Objectives
Show the fundamentals of the use of Big Data in official statistics using three different aspects of Big Data: web scraping, text mining and machine learning. The course will be based on practical examples and exercises that will allow participants in better understanding the concept of the use of Big Data in official statistics. Examples will be based on real issues of data gathering and processing for official statistics.
Participants’ profile
Should have a basic knowledge about the concept of Big Data. No need to have any programming skills – all examples in Python will be conducted with instructor. Should be familiar with the basic use of IT tools.
Overall description and approach
The workshop will be based on examples – participants will do exercises and run examples in Python language. Fundamentals of web scraping and machine learning will be provided with practical examples to use in official statistics. The goal is to identify the risk with web scraping and machine learning regarding the use in official statistics in terms of the data quality.
Facilitators/ Instructors (short biographical note)
Jacek Maślankowski, Ph.D., is a researcher and academic teacher at the Department of Business Informatics, University of Gdańsk and consultant in Statistical Office in Gdańsk (Statistics Poland). His research activities mostly concentrate on Big Data and Data Warehousing with Business Intelligence. He is an author of numerous publications regarding Big Data appliances. He was involved in Official Statistics in several Big Data projects, including contribution to the Big Data Quality Framework by UNECE. Currently he is a delegate to the ESSNet Big Data work packages: WP2 – Web Scraping Enterprise Characteristics (member), WP7 – Multi Domain (leader of methodology, consultant) and WP8 – Methodology (member, internal coordinator of IT Report). He is the author or co-author of statistical software, including Big Data Social Media Presence and Life Satisfaction.