Introduction to Machine Learning for Policy Analysis (Maastricht 2024)
Welcome
Dear Students, welcome to the course repository, where you will find all informations supplementing this term’s machine learning for policy analysis course. Here you will find the lectures on the two topics introduced (Supervised Machine Learning & Natural Language Processing) in video format plus facilitating rmarkdown notebooks.
To get the most out of this lectures, I expect you to have R & R-Studio installed and updated on your local machine, and to be generally used to do data analytics in R using the ´tidyverse´ ecosystem. If that is not the case, you might want to take a look at the adittional resoures such as ´My R Brush-up course (Bonus)´ below, where I recap the fundamentals of working with data in R.
::::::::::::::> Watch this intro video to get started <:::::::::::::::::
Lecturer (briefly about me)
Daniel is an Strategic Business Manager at NovoNordisk, where his team develops data driven methods and workflows to improve the performance of clinical trials. He is also an Associate Professor in Data Science & Innovation Economics at the Aalborg University Business School, where he was leading the Data Science research track at the AI:Growth lab, and coordinated teaching at the Social Data Science (SDS) master specialization. His research is dedicated to the development and application of data-driven methods to map, understand, and predict technological change, and its causes and consequences for socioeconomic systems on various levels of aggregation. His current contextual focus is the dynamics of AI research and industry.
His research is featured in leading academic journals such as Research Policy, but also attracted attention and funding from the industry, and lead to price-winning applications. Daniel is actively engaged in initiatives to educate (social science) students and researchers, professionals, and policymakers in understanding, evaluating, and applying modern Data Science and Artificial Intelligence methods for data-driven decision making.
As part of the AI:DK project, he coordinates and leads AI proof-of-concept projects within industry. His team also develops enterprise and policy software solutions for IP search and technology mapping.
Lectures
Legend:
- T: Theory lecture, explaining concepts without using to much code
- A: Applications and demonstrations of concepts and techniques, mostly code-based
- E: Exercises for you to try your skills
Introduction to Supervised Machine Learning (S-ML) in R
This part will introduce you to the fundamentals of supervised machine learning (SML, aka. predictive modelling), and illustrate practical applications theeof in R.
Introduction to Natural-Language-Processing (NLP) in R
In this part you will be introduced to the fundamentals of analysing textual data, and the practical application in R. After reviwing the basics of string manipulation, we will move to bag-of-word style text summaries, and move on to slightly more advanced applications such as sentiment analysis and topic modelling.
Further Resources
Find below a list of further resources (including own material), either to brush-up basic R knowledge, supplement what you learn here, or dive deeper into related or advanced topics.
Own research: Technology forecasting with ML & NLP
- Hain, D. S., Jurowetzki, R., Squicciarini, M., & Xu, L. (2023). Unveiling the neurotechnology landscape: scientific advancements innovations and major trends.
- Nechaev, I., & Hain, D. S. (2023). Social impacts reflected in CSR reports: Method of extraction and link to firms innovation capacity. Journal of Cleaner Production, 429, 139256.
- Hain, Daniel, et al. Hain, D. S., Jurowetzki, R., Buchmann, T., & Wolf, P. (2022). A text-embedding-based approach to measuring patent-to-patent technological similarity. Technological Forecasting and Social Change, 177, 121559.: Own paper, where we introduce to text embeddings and use it to map technology based on patent data.
- Bekamiri, H., Hain, D. S., & Jurowetzki, R. (2021). PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT. arXiv preprint arXiv:2103.11933.: More advanced version of the use of embeddings on.
- Hain, Daniel, et al. Hain, D. S., Jurowetzki, R., Buchmann, T., & Wolf, P. (2022). A text-embedding-based approach to measuring patent-to-patent technological similarity. Technological Forecasting and Social Change, 177, 121559.: Own paper, where we introduce to text embeddings and use it to map technology based on patent data.patents.
- Hain, D. S., Jurowetzki, R., Konda, P., & Oehler, L. (2020). From catching up to industrial leadership: towards an integrated market-technology perspective. An application of semantic patent-to-patent similarity in the wind and EV sector. Industrial and Corporate Change, 29(5), 1233-1255.: Application of the technique.
Data Science in R in general
- Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. O’Reilly Media, Inc.: The bible of modern data science in R. USe this to get started.
- Baumer, B., Kaplan, D. & Horton, N. (2020) Modern Data Science with R (2nd Ed.). CRC Press : Also nice supplementart book, also touching upon topics such as simulation and network analysis.
- Ismay & Kim (2020), Statistical Inference via Data Science: A ModernDive into R and the Tidyverse, CRC Press.: For those who want to first update their knowledge in basic and inferential statistics in a modern R setup.
Supervised Machine Learning
- Hain, D., & Jurowetzki, R. (2020). Introduction to Rare-Event Predictive Modeling for Inferential Statisticians–A Hands-On Application in the Prediction of Breakthrough Patents. arXiv preprint arXiv:2003.13441.: Some of our introductory papers. An a bit more elaborate version of what we did so far on a more exciting dataset.
- Kuhn, M., Silge, J. (2020). Tidy Modeling with R: GReat introduction to
tidymodels
by the makers.
- Kuhn, M. & Johnson (2019), Feature Engineering and Selection: A Practical Approach for Predictive Models, Taylor & Francis.: Less code but much deep insights in modern ML details, by Thomas Kuhn, the maker of much of
tidymodels
and caret
- Silge, Julia (2020). Supervised Machine Learning Case Studies in R. Online course: Great interactive course Julia took out of DataCamp to offer it for free instead. Fully updated to the tidymodels workflow. YOU ALL SHOULD DO IT!
Natural Language Processing
Further topics of (potential) interest
My R Brush-up course (Bonus)
As a bonus, find some very basic introductions to working with data in R (from another course of mine) below. If you are already used to work with R and the tidyverse, no need to do so. But in case you feel your R skills need a bit of a brush up, feel free to go through the material before auditing my classes.