Project Gutenberg: Text & Audio Books
About this resource
The Project Gutenberg dataset contains text from thousands of books, spanning a variety of genres and styles, and in some cases, corresponding audiobooks. Researchers and students working on machine learning applications can use this dataset to explore tasks such as language modeling, text classification, summarization, and speech synthesis. The dataset’s availability in both text and audio formats makes it suitable for multimodal learning tasks as well.
Key features
- Text & audio: Many books in the Gutenberg collection have corresponding audiobooks (through Librivox), enabling both text and audio-based learning tasks.
- Multilingual content: While primarily in English, the dataset includes books in other languages such as French, German, and Spanish, providing opportunities for multilingual and cross-lingual research in NLP.
- Long-form text: The dataset includes full-length novels, short stories, and essays, making it ideal for tasks that require understanding context over longer sequences of text.
Key applications
- Language modeling: With its vast variety of literary styles and genres, Gutenberg serves as a valuable resource for training and evaluating language models like GPT and BERT. Pre-training on Gutenberg’s diverse text corpus allows models to capture nuanced linguistic patterns, which can later be fine-tuned for more specific NLP tasks.
- Text classification: The dataset can be applied to classification tasks such as genre classification or sentiment analysis. Researchers often use Gutenberg to train classifiers that distinguish between literary styles or detect emotional tone in texts.
- Summarization and translation: Due to the diversity in content, Gutenberg is commonly used to test summarization models (e.g., creating concise book summaries) and translation algorithms across different literary forms.
- Topic modeling: The diverse collection of texts allows for the exploration of underlying themes or topics through techniques like Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF), enabling researchers to uncover hidden patterns in the literature.
- Multimodal learning: Paired with Librivox audiobooks, the Gutenberg dataset enables multimodal tasks like text-to-speech synthesis, speech recognition, and aligning spoken text with its written counterpart. This supports the development of models like Tacotron and Wav2Vec.
- Transfer learning: Researchers frequently fine-tune pre-trained language models on Gutenberg to test performance on literary and long-form text, often comparing results with models trained on broader corpora like Common Crawl.
- Data augmentation: Gutenberg’s large-scale, structured text is ideal for augmenting smaller datasets and improving model robustness through data imputation or other generalization techniques.
Loading data in Python
You can easily load text data from Project Gutenberg in Python using the gutenbergpy
or requests
libraries. Here’s a basic example using gutenbergpy
:
- Install the
gutenbergpy
library:
!pip install gutenbergpy
- Load a book from Project Gutenberg You’ll need the Gutenberg Book ID, which you can find by searching the Gutenberg website for the book you want. The Book ID is the number found at the end of the book’s URL. For example, the URL https://www.gutenberg.org/ebooks/1342 corresponds to Pride and Prejudice, and the Book ID is 1342.
from gutenbergpy.textget import get_text_by_id
from gutenbergpy.textget import strip_headers
# Replace '1342' with the ID of the book you want to download
= 1342
book_id = get_text_by_id(book_id)
book_text = strip_headers(book_text).strip()
book_text_clean
# Print the first 500 characters
print(book_text_clean[:500])
Questions?
If you have any lingering questions about this resource, feel free to post them on the ML+X Nexus Q&A on GitHub. We will update this resource as new information or applications arise.
See also
- Workshop: Intro to Text Analysis / NLP: A hands-on introduction to natural language processing and how to extract insights from text data.