Intro to Text Analysis / NLP (Carpentries)

Workshops
Deep learning
Hugging Face
Text analysis
NLP
LLM
Carpentries
Code-along
Author

Chris Endemann

Published

July 13, 2024

About this resource

The Intro to Text Analysis workshop introduces the field of Natural Language Processing (NLP) and how to gain insights from collections of text data (i.e., a corpus). This includes a hands-on, step-by-step guide on how to source and prepare a corpus for analysis, generate text (document/sentence/word) embeddings, perform topic modeling, deploy common models (e.g., Word2Vec and large language models using Hugging Face), and ethical considerations. Students and researchers working with text data (especially the digital humanities!) are encouraged to take this workshop!

Prerequisites

Learners are expected to have basic Python programming skills and familiarity with the Pandas package. If you need a refresher, the Introductory Python lesson materials are available for independent study.

Estimated time to complete

This workshop takes approximately 16 hours to complete.

Register to take this workshop in Madison!

The Carpentries is a global organization of researchers who volunteer their time and effort to create workshops that teach software engineering and data analysis skills to other researchers. UW-Madison has its own local Carpentries community which is actively engaged in developing new ML/AI workshops. To be notified of upcoming workshops offered by the Carpentries, make sure to subscribe to the Data Science @ UW Newsletter.

Alternatively, work through the materials independently!

All Carpentries lessons are published as open source educational materials. You are welcome and encouraged to visit the lesson materials to work through them on your own. If you are involved with a research lab at UW-Madison campus, you may attend Coding Meetup (Tue/Thur, 2:30-4:30pm) to get help working through the materials.

See also

  • Workshop: Intro to Deep Learning with Keras: Explore deep learning concepts in greater detail. This will help you better understand the technology (neural networks) needed for Word2Vec and large language models.
  • Book: Understanding Deep Learning - Simon J.D. Prince: This free textbook is a good modern overview of deep learning (and machine learning in general), and provides colab notebooks to explore deep learning concepts and implementations. The book uses PyTorch as its framework of choice. You may find additional details in this book that the workshop only briefly touches on.