Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
Mining Whole Museum Collections Datasets for
Expanding Understanding of Collections with
the GUODA Service
Matthew Collins (iDigBio)
Jorrit Poelen (independant)
Alexander Thompson (iDigBio)
Jennifer Hammock (EOL)

What We’re Interested In
Computation with biodiversity data
• Research at scale
• Lowering barriers to access
• Reproducability
Matthew Collins
Technical Operations
Manager - iDigBio
Jorrit Poelen
Independant
Alexander Thompson
Software Products
Lead - iDigBio
Jennifer Hammock
Marine Theme
Coordinator - EOL

Quick Review of Ways That We Work With Datasets
Focus here is on using large aggregated datasets to answer
research questions

Working With Datasets - Web Portals
Good: searching, visualizing location, browsing
Less good: data characterization, modeling, analysis, graphing

Working With Data - Purpose-Built Applications
Good: low barrier to entry, expert-built, documentation, peers
Less good: limited scope, limited ability to change

Working With Data - APIs & Libraries
Good: direct access to data, some simple analysis
Less good: programming barrier, performance limits

Working With Data - Download & Code
Good: ultimate flexibility, combine & merge
Less good: data management barrier, you’re the sysadmin

Working With Data - GUODA
Global Unified Open Data Access
(If SPNHC can be Spinach, GUODA Gouda)
An informal collaboration between technologists
from organizations like EOL , ePANDDA, and iDigBio as well as
independent biodiversity informaticists. We share data use
cases, best practices, infrastructure, code, and ideas around
the science that can be done by analyzing large open-access
biodiversity datasets.

Working With Data - GUODA Continued
Goals
• Have technologists discuss the technical challenges and
solution approaches in the biodiversity informatics domain
• Provide on-ramp for those who might not think of
themselves as “technologists”
• Fast parallel computation infrastructure and practices
(currently using Apache Spark)
• Local copies of entire datasets already formatted, ready for
computation at scale on provided infrastructure
• Hosting for services that rely on above

What Questions Does GUODA Make Approachable?
Can we create structured data from the unstructured text in
iDigBio records?
GUODA provides a platform to quickly start working on this
problem.
1. No data download
2. Jupyter Notebooks
3. Parallel processing of entire dataset

Data Characterization
Looking at the Darwin
Core terms
fieldNotes,
occurrenceRemarks,
and eventRemarks to
see how many
characters are in
which fields

The Code to Produce That Figure
idbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet")
notes = sqlContext.sql("""
SELECT
`http://portal.idigbio.org/terms/uuid` as uuid,
TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ',
`http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ',
`http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document
FROM idbtable WHERE
`http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR
`http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR
`http://rs.tdwg.org/dwc/terms/eventRemarks` != ''
""")
notes = notes.withColumn('document_len', sql.length(notes['document']))
notes = notes.withColumn('fieldNotes_len', sql.length(notes['fieldNotes']))
notes = notes.withColumn('eventRemarks_len', sql.length(notes['eventRemarks']))
notes = notes.withColumn('occurrenceRemarks_len', sql.length(notes['occurrenceRemarks']))
notes_pd = notes[ sub_set ].toPandas()
sns.distplot(notes_pd['document_len'].dropna().apply(numpy.log10))
sns.distplot(notes_pd['fieldNotes_len'].dropna()[ notes_pd['fieldNotes_len']>0
].apply(numpy.log10))
sns.distplot(notes_pd['occurrenceRemarks_len'].dropna()[ notes_pd['occurrenceRemarks_len']>0
ax = sns.distplot(notes_pd['eventRemarks_len'].dropna()[ notes_pd['eventRemarks_len']>0

The Interface to Write The Code
Notebooks
“Literate Programming”
Comments, code, and
outputs all together in a
readable document that
describes what is being
done

GUODA Notebook Architecture
A look at interacting with the GUODA data service through
Jupyter Notebooks

GUODA Data Service At Scale
Python NLTK parsing
and part-of-speech
tagging of notes fields
with noun-phrase
assembly.
Example phrases:
• Intercept trap
• Forest litters
• Field notes
• Field notebook
• Fogging fungus covered log
• Tropical forest
• Flight intercept trap

The Code - 6 minutes for 3.2M Records
c.train(c.load_training_data("../data/chunker_training_50_fixed.json"))
def pipeline(s):
return c.assemble(c.tag(p.tag(t.tokenize(s))))
pipeline_udf = sql.udf(pipeline, types.ArrayType(
types.MapType(
types.StringType(),
types.StringType()
)))
phrases = notes
.withColumn("phrases", pipeline_udf(notes["document"]))
.select(sql.explode(sql.col("phrases")).alias("text"))
.filter(sql.col("text")["tag"] == "NP")
.select(sql.lower(sql.col("text")["phrase"]).alias("phrase"))
.groupBy(sql.col("phrase"))
.count()
phrases.write.parquet('../data/idigbio_phrases.parquet')

What Else is GUODA Besides Notebooks?
Remember “collaboration” and “infrastructure” to lower
barriers
• Twice monthly Google Hangouts
• Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL,
TraitBank so far
• Apache Spark cluster for computation
• Backs Effechecka http://effechecka.org/
• Backs Fresh Data https://github.com/gimmefreshdata/
• ePANDDA (we’re sharing ideas)
• iDigBio data quality workflows

Why is GUODA Important?
Perform research at a faster pace by “outsourcing” some of the
harder parts
Collect entire large datasets together in one place for cross-
dataset exploration without data management barrier
Provides a foundation, both community and infrastructure,
upon which to build purpose-built applications and APIs
bigger and faster than before

How You Can Fit With GUODA
• Make your data available
• Data standards to make it relatable to other datasets
• Making data available doesn’t end with handoff to the
aggregator - where is your data used?
• Support workforce development
• Support next-wave things like ePANDDA
• Collaborate with GUODA when starting your own research

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
www.idigbio.org
facebook.com/iDigBio
twitter.com/iDigBio
vimeo.com/idigbio
idigbio.org/rss-feed.xml
webcal://www.idigbio.org/events-calendar/export.ics
Thank you!
http://guoda.bio

Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Ähnlich wie Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service