!pip install requests
GeoDeepDive: Unlocking Knowledge from Scientific Literature
Overview
GeoDeepDive (GDD) is a cyberinfrastructure project designed to accelerate scientific discovery by extracting information from the vast and growing body of published scientific literature. While its roots are in geology, its applications span any domain that relies on published texts, including biology, materials science, medicine, and social sciences.
At its core, GDD is a massive database of over 15 million scientific documents (articles, theses, reports) that have been processed through a high-performance computing pipeline. This pipeline performs:
- Optical Character Recognition (OCR) to convert scanned PDFs into machine-readable text.
- Natural Language Processing (NLP) to parse sentences, identify parts of speech, and perform named entity recognition (e.g., finding mineral names, locations, species).
- Relation Extraction to find and catalog relationships between entities (e.g., “mineral X is found at location Y”).
The result is not just a collection of texts, but a structured, queryable knowledge graph. Researchers can use GDD’s public API to ask complex questions that would be impossible to answer by manual literature review, such as “find all papers that mention a specific fossil and its geological age” or “extract all measured values of a particular chemical compound.”
Prerequisites
- Basic familiarity with Python and making HTTP requests.
- A GitHub account (to use GDD’s public API).
- An understanding of basic NLP concepts (Token, Sentence, Named Entity) is helpful but not strictly required to run the example.
Key Concepts and Definitions
Concept | Definition |
---|---|
Document | Any processed text unit in the GDD database, typically a scientific publication. |
NLP | Natural Language Processing, the field of AI concerned with interactions between computers and human language. |
Named Entity Recognition (NER) | An NLP task to identify and classify key information (entities) in text into predefined categories like persons, organizations, locations, etc. In GDD, these are often scientific terms. |
API (Application Programming Interface) | A set of rules and tools that allows different software applications to communicate with each other. GDD provides an API to query its database programmatically. |
JSON | JavaScript Object Notation, a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate. It is the primary format for data returned by the GDD API. |
Tutorial: Querying the GeoDeepDive API for Mineral Mentions
This tutorial will guide you through a simple example of using Python to query the GeoDeepDive API to find sentences that mention the mineral “stishovite.”
Step 1: Get Your GitHub Token
The GDD API uses GitHub OAuth for authentication. You need to generate a personal access token.
- Go to your GitHub Settings.
- Navigate to Developer settings > Personal access tokens > Tokens (classic).
- Click Generate new token (classic). Give it a descriptive note (e.g., “GeoDeepDive API”).
- Select the
public_repo
scope. This is sufficient. - Click Generate token and copy the token immediately (you won’t see it again!).
Step 2: Set Up Your Python Environment
We’ll use the requests
library to make HTTP calls. Let’s install it:
Step 3: Configure Your Authentication
Now, let’s set up your authentication. Replace the placeholders with your actual GitHub credentials:
import requests
import json
# Replace these with your actual GitHub credentials
= "YourGitHubUsername"
GITHUB_USERNAME = "YOUR_GITHUB_TOKEN" # Replace with the token you generated
GITHUB_TOKEN
print("Authentication configured successfully!")
Step 4: Query the GeoDeepDive API
Let’s search for documents mentioning the mineral “stishovite”:
# The public endpoint for the GDD API
= "https://geodeepdive.org/api/articles"
url
# The parameters for our query. We want sentences about 'stishovite'
= {
params "term": "stishovite", # The word or phrase to search for
"full_results": True, # Get full details, including sentences
"sentences": True # Include the sentences in the response
}
# Make the GET request to the API with authentication
= requests.get(url, params=params, auth=(GITHUB_USERNAME, GITHUB_TOKEN))
response
# Check if the request was successful
if response.status_code == 200:
= response.json()
data print(f" Found {data['success']['total']} documents mentioning 'stishovite'.\n")
# Loop through the first few documents and print relevant sentences
for i, doc in enumerate(data['success']['data'][:3]): # Look at first 3 docs
print(f" Document {i+1}: {doc['_gddid']}")
print(f" Title: {doc.get('title', 'No title available')}")
print(" Sentences found:")
= [s for s in doc['sentences'] if 'stishovite' in s['text'].lower()]
stishovite_sentences
for j, sentence in enumerate(stishovite_sentences[:2]): # Show first 2 sentences per doc
print(f" {j+1}. {sentence['text']}")
print(f" Total sentences with 'stishovite': {len(stishovite_sentences)}")
print("\n" + "─" * 80 + "\n")
else:
print(f" Error: {response.status_code}")
print(response.text)
Step 5: Advanced Query - Filter by Journal
Let’s try a more specific query to find papers in specific journals:
# Search for stishovite in specific journals
= {
advanced_params "term": "stishovite",
"journal": "science,nature,geology", # Filter by journal names
"full_results": True,
"sentences": True,
"limit": 5 # Limit to 5 results
}
= requests.get(url, params=advanced_params, auth=(GITHUB_USERNAME, GITHUB_TOKEN))
advanced_response
if advanced_response.status_code == 200:
= advanced_response.json()
advanced_data print(f"🔍 Found {advanced_data['success']['total']} documents in specified journals.")
if advanced_data['success']['total'] > 0:
print("\n📊 Journal distribution:")
= {}
journals for doc in advanced_data['success']['data']:
= doc.get('journal', 'Unknown')
journal = journals.get(journal, 0) + 1
journals[journal]
for journal, count in journals.items():
print(f" {journal}: {count} documents")
else:
print("No documents found in the specified journals.")
else:
print(f" Advanced query error: {advanced_response.status_code}")
Step 6: Export Results (Optional)
Let’s export the results to a JSON file for further analysis:
import json
from datetime import datetime
# Export the results
if response.status_code == 200:
= {
export_data "query": "stishovite",
"execution_date": datetime.now().isoformat(),
"total_documents": data['success']['total'],
"sample_documents": data['success']['data'][:5] # First 5 documents
}
with open('geodeepdive_results.json', 'w') as f:
=2)
json.dump(export_data, f, indent
print("💾 Results exported to 'geodeepdive_results.json'")
# Show a preview
print("\n📋 Preview of exported data:")
print(f"Total documents: {export_data['total_documents']}")
print(f"Sample size: {len(export_data['sample_documents'])}")
Summary
GeoDeepDive is a powerful tool for moving beyond simple keyword searches to true knowledge extraction. By providing programmatic access to a deeply processed corpus of scientific literature, it enables researchers to ask complex, data-driven questions at a scale that was previously impossible.
- Key Takeaway: GDD turns unstructured text into structured, queryable data.
- What we accomplished: We successfully queried the GeoDeepDive API, retrieved scientific documents mentioning “stishovite,” filtered results by journal, and exported the data for further analysis.
Additional Resources
- GeoDeepDive Official Website
- GitHub Guide for Personal Access Tokens
- Requests: HTTP for Humans (Python Library Docs)
Note: Remember to keep your GitHub token secure and never share it publicly. For production use, consider using environment variables or secure secret management.