SlideShare ist ein Scribd-Unternehmen logo
1 von 20
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
Mining Whole Museum Collections Datasets for
Expanding Understanding of Collections with
the GUODA Service
Matthew Collins (iDigBio)
Jorrit Poelen (independant)
Alexander Thompson (iDigBio)
Jennifer Hammock (EOL)
What We’re Interested In
Computation with biodiversity data
• Research at scale
• Lowering barriers to access
• Reproducability
Matthew Collins
Technical Operations
Manager - iDigBio
Jorrit Poelen
Independant
Alexander Thompson
Software Products
Lead - iDigBio
Jennifer Hammock
Marine Theme
Coordinator - EOL
Quick Review of Ways That We Work With Datasets
Focus here is on using large aggregated datasets to answer
research questions
Working With Datasets - Web Portals
Good: searching, visualizing location, browsing
Less good: data characterization, modeling, analysis, graphing
Working With Data - Purpose-Built Applications
Good: low barrier to entry, expert-built, documentation, peers
Less good: limited scope, limited ability to change
Working With Data - APIs & Libraries
Good: direct access to data, some simple analysis
Less good: programming barrier, performance limits
Working With Data - Download & Code
Good: ultimate flexibility, combine & merge
Less good: data management barrier, you’re the sysadmin
Working With Data - GUODA
Global Unified Open Data Access
(If SPNHC can be Spinach, GUODA Gouda)
An informal collaboration between technologists
from organizations like EOL , ePANDDA, and iDigBio as well as
independent biodiversity informaticists. We share data use
cases, best practices, infrastructure, code, and ideas around
the science that can be done by analyzing large open-access
biodiversity datasets.
Working With Data - GUODA Continued
Goals
• Have technologists discuss the technical challenges and
solution approaches in the biodiversity informatics domain
• Provide on-ramp for those who might not think of
themselves as “technologists”
• Fast parallel computation infrastructure and practices
(currently using Apache Spark)
• Local copies of entire datasets already formatted, ready for
computation at scale on provided infrastructure
• Hosting for services that rely on above
What Questions Does GUODA Make Approachable?
Can we create structured data from the unstructured text in
iDigBio records?
GUODA provides a platform to quickly start working on this
problem.
1. No data download
2. Jupyter Notebooks
3. Parallel processing of entire dataset
Data Characterization
Looking at the Darwin
Core terms
fieldNotes,
occurrenceRemarks,
and eventRemarks to
see how many
characters are in
which fields
The Code to Produce That Figure
idbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet")
notes = sqlContext.sql("""
SELECT
`http://portal.idigbio.org/terms/uuid` as uuid,
TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ',
`http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ',
`http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document
FROM idbtable WHERE
`http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR
`http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR
`http://rs.tdwg.org/dwc/terms/eventRemarks` != ''
""")
notes = notes.withColumn('document_len', sql.length(notes['document']))
notes = notes.withColumn('fieldNotes_len', sql.length(notes['fieldNotes']))
notes = notes.withColumn('eventRemarks_len', sql.length(notes['eventRemarks']))
notes = notes.withColumn('occurrenceRemarks_len', sql.length(notes['occurrenceRemarks']))
notes_pd = notes[ sub_set ].toPandas()
sns.distplot(notes_pd['document_len'].dropna().apply(numpy.log10))
sns.distplot(notes_pd['fieldNotes_len'].dropna()[ notes_pd['fieldNotes_len']>0
].apply(numpy.log10))
sns.distplot(notes_pd['occurrenceRemarks_len'].dropna()[ notes_pd['occurrenceRemarks_len']>0
].apply(numpy.log10))
ax = sns.distplot(notes_pd['eventRemarks_len'].dropna()[ notes_pd['eventRemarks_len']>0
].apply(numpy.log10))
The Interface to Write The Code
Notebooks
“Literate Programming”
Comments, code, and
outputs all together in a
readable document that
describes what is being
done
GUODA Notebook Architecture
A look at interacting with the GUODA data service through
Jupyter Notebooks
GUODA Data Service At Scale
Python NLTK parsing
and part-of-speech
tagging of notes fields
with noun-phrase
assembly.
Example phrases:
• Intercept trap
• Forest litters
• Field notes
• Field notebook
• Fogging fungus covered log
• Tropical forest
• Flight intercept trap
The Code - 6 minutes for 3.2M Records
c.train(c.load_training_data("../data/chunker_training_50_fixed.json"))
def pipeline(s):
return c.assemble(c.tag(p.tag(t.tokenize(s))))
pipeline_udf = sql.udf(pipeline, types.ArrayType(
types.MapType(
types.StringType(),
types.StringType()
)))
phrases = notes
.withColumn("phrases", pipeline_udf(notes["document"]))
.select(sql.explode(sql.col("phrases")).alias("text"))
.filter(sql.col("text")["tag"] == "NP")
.select(sql.lower(sql.col("text")["phrase"]).alias("phrase"))
.groupBy(sql.col("phrase"))
.count()
phrases.write.parquet('../data/idigbio_phrases.parquet')
What Else is GUODA Besides Notebooks?
Remember “collaboration” and “infrastructure” to lower
barriers
• Twice monthly Google Hangouts
• Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL,
TraitBank so far
• Apache Spark cluster for computation
• Backs Effechecka http://effechecka.org/
• Backs Fresh Data https://github.com/gimmefreshdata/
• ePANDDA (we’re sharing ideas)
• iDigBio data quality workflows
Why is GUODA Important?
Perform research at a faster pace by “outsourcing” some of the
harder parts
Collect entire large datasets together in one place for cross-
dataset exploration without data management barrier
Provides a foundation, both community and infrastructure,
upon which to build purpose-built applications and APIs
bigger and faster than before
How You Can Fit With GUODA
• Make your data available
• Data standards to make it relatable to other datasets
• Making data available doesn’t end with handoff to the
aggregator - where is your data used?
• Support workforce development
• Support next-wave things like ePANDDA
• Collaborate with GUODA when starting your own research
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
www.idigbio.org
facebook.com/iDigBio
twitter.com/iDigBio
vimeo.com/idigbio
idigbio.org/rss-feed.xml
webcal://www.idigbio.org/events-calendar/export.ics
Thank you!
http://guoda.bio

Weitere ähnliche Inhalte

Andere mochten auch

I/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer KumarI/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer KumarBEAM - Bridge Events & Meets
 
Aplicaciones web 2
Aplicaciones web 2Aplicaciones web 2
Aplicaciones web 2roxana1995
 
Herramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdfHerramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdfnayarbarom
 
Workshop Mentes hiperactivas
Workshop Mentes hiperactivasWorkshop Mentes hiperactivas
Workshop Mentes hiperactivasJorge Lima
 
Carnaval 2012 no CEGV
Carnaval 2012 no CEGVCarnaval 2012 no CEGV
Carnaval 2012 no CEGVJorge Lima
 
Новые возможности с MONAVIE
Новые возможности с MONAVIEНовые возможности с MONAVIE
Новые возможности с MONAVIENatalya Shulga
 
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitoresEdital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitoresRaquel Freitas
 
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...Giorgio Federico Garbetta
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppDynamics
 
Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!Dan English
 
Social Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small FirmsSocial Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small FirmsInternet Law Center
 
Kra & kpa by nitish rathi
Kra & kpa by nitish rathiKra & kpa by nitish rathi
Kra & kpa by nitish rathiNitish Rathi
 
Le merchandising au sein de la grande distribution
Le merchandising au sein de la grande distributionLe merchandising au sein de la grande distribution
Le merchandising au sein de la grande distributionGuillaume Bourgogne
 
Descripcion mascota
Descripcion mascotaDescripcion mascota
Descripcion mascotagracielasudi
 

Andere mochten auch (18)

Mohammad SUPERVISOR
Mohammad SUPERVISORMohammad SUPERVISOR
Mohammad SUPERVISOR
 
I/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer KumarI/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer Kumar
 
Aplicaciones web 2
Aplicaciones web 2Aplicaciones web 2
Aplicaciones web 2
 
Herramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdfHerramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdf
 
Workshop Mentes hiperactivas
Workshop Mentes hiperactivasWorkshop Mentes hiperactivas
Workshop Mentes hiperactivas
 
Carnaval 2012 no CEGV
Carnaval 2012 no CEGVCarnaval 2012 no CEGV
Carnaval 2012 no CEGV
 
Новые возможности с MONAVIE
Новые возможности с MONAVIEНовые возможности с MONAVIE
Новые возможности с MONAVIE
 
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitoresEdital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitores
 
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance Challenges
 
Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!
 
Social Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small FirmsSocial Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small Firms
 
Kra & kpa by nitish rathi
Kra & kpa by nitish rathiKra & kpa by nitish rathi
Kra & kpa by nitish rathi
 
QA automation
QA automationQA automation
QA automation
 
Laurie_Skipper_Resume_2017 Business
Laurie_Skipper_Resume_2017 BusinessLaurie_Skipper_Resume_2017 Business
Laurie_Skipper_Resume_2017 Business
 
Le merchandising au sein de la grande distribution
Le merchandising au sein de la grande distributionLe merchandising au sein de la grande distribution
Le merchandising au sein de la grande distribution
 
Descripcion mascota
Descripcion mascotaDescripcion mascota
Descripcion mascota
 
Disaster
DisasterDisaster
Disaster
 

Ähnlich wie Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Hao Chen
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...Matthew J Collins
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta PyData
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadatasuyu22
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerMongoDB
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 

Ähnlich wie Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service (20)

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadata
 
Green dao
Green daoGreen dao
Green dao
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 

Kürzlich hochgeladen

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 

Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

  • 1. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service Matthew Collins (iDigBio) Jorrit Poelen (independant) Alexander Thompson (iDigBio) Jennifer Hammock (EOL)
  • 2. What We’re Interested In Computation with biodiversity data • Research at scale • Lowering barriers to access • Reproducability Matthew Collins Technical Operations Manager - iDigBio Jorrit Poelen Independant Alexander Thompson Software Products Lead - iDigBio Jennifer Hammock Marine Theme Coordinator - EOL
  • 3. Quick Review of Ways That We Work With Datasets Focus here is on using large aggregated datasets to answer research questions
  • 4. Working With Datasets - Web Portals Good: searching, visualizing location, browsing Less good: data characterization, modeling, analysis, graphing
  • 5. Working With Data - Purpose-Built Applications Good: low barrier to entry, expert-built, documentation, peers Less good: limited scope, limited ability to change
  • 6. Working With Data - APIs & Libraries Good: direct access to data, some simple analysis Less good: programming barrier, performance limits
  • 7. Working With Data - Download & Code Good: ultimate flexibility, combine & merge Less good: data management barrier, you’re the sysadmin
  • 8. Working With Data - GUODA Global Unified Open Data Access (If SPNHC can be Spinach, GUODA Gouda) An informal collaboration between technologists from organizations like EOL , ePANDDA, and iDigBio as well as independent biodiversity informaticists. We share data use cases, best practices, infrastructure, code, and ideas around the science that can be done by analyzing large open-access biodiversity datasets.
  • 9. Working With Data - GUODA Continued Goals • Have technologists discuss the technical challenges and solution approaches in the biodiversity informatics domain • Provide on-ramp for those who might not think of themselves as “technologists” • Fast parallel computation infrastructure and practices (currently using Apache Spark) • Local copies of entire datasets already formatted, ready for computation at scale on provided infrastructure • Hosting for services that rely on above
  • 10. What Questions Does GUODA Make Approachable? Can we create structured data from the unstructured text in iDigBio records? GUODA provides a platform to quickly start working on this problem. 1. No data download 2. Jupyter Notebooks 3. Parallel processing of entire dataset
  • 11. Data Characterization Looking at the Darwin Core terms fieldNotes, occurrenceRemarks, and eventRemarks to see how many characters are in which fields
  • 12. The Code to Produce That Figure idbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet") notes = sqlContext.sql(""" SELECT `http://portal.idigbio.org/terms/uuid` as uuid, TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document FROM idbtable WHERE `http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR `http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR `http://rs.tdwg.org/dwc/terms/eventRemarks` != '' """) notes = notes.withColumn('document_len', sql.length(notes['document'])) notes = notes.withColumn('fieldNotes_len', sql.length(notes['fieldNotes'])) notes = notes.withColumn('eventRemarks_len', sql.length(notes['eventRemarks'])) notes = notes.withColumn('occurrenceRemarks_len', sql.length(notes['occurrenceRemarks'])) notes_pd = notes[ sub_set ].toPandas() sns.distplot(notes_pd['document_len'].dropna().apply(numpy.log10)) sns.distplot(notes_pd['fieldNotes_len'].dropna()[ notes_pd['fieldNotes_len']>0 ].apply(numpy.log10)) sns.distplot(notes_pd['occurrenceRemarks_len'].dropna()[ notes_pd['occurrenceRemarks_len']>0 ].apply(numpy.log10)) ax = sns.distplot(notes_pd['eventRemarks_len'].dropna()[ notes_pd['eventRemarks_len']>0 ].apply(numpy.log10))
  • 13. The Interface to Write The Code Notebooks “Literate Programming” Comments, code, and outputs all together in a readable document that describes what is being done
  • 14. GUODA Notebook Architecture A look at interacting with the GUODA data service through Jupyter Notebooks
  • 15. GUODA Data Service At Scale Python NLTK parsing and part-of-speech tagging of notes fields with noun-phrase assembly. Example phrases: • Intercept trap • Forest litters • Field notes • Field notebook • Fogging fungus covered log • Tropical forest • Flight intercept trap
  • 16. The Code - 6 minutes for 3.2M Records c.train(c.load_training_data("../data/chunker_training_50_fixed.json")) def pipeline(s): return c.assemble(c.tag(p.tag(t.tokenize(s)))) pipeline_udf = sql.udf(pipeline, types.ArrayType( types.MapType( types.StringType(), types.StringType() ))) phrases = notes .withColumn("phrases", pipeline_udf(notes["document"])) .select(sql.explode(sql.col("phrases")).alias("text")) .filter(sql.col("text")["tag"] == "NP") .select(sql.lower(sql.col("text")["phrase"]).alias("phrase")) .groupBy(sql.col("phrase")) .count() phrases.write.parquet('../data/idigbio_phrases.parquet')
  • 17. What Else is GUODA Besides Notebooks? Remember “collaboration” and “infrastructure” to lower barriers • Twice monthly Google Hangouts • Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL, TraitBank so far • Apache Spark cluster for computation • Backs Effechecka http://effechecka.org/ • Backs Fresh Data https://github.com/gimmefreshdata/ • ePANDDA (we’re sharing ideas) • iDigBio data quality workflows
  • 18. Why is GUODA Important? Perform research at a faster pace by “outsourcing” some of the harder parts Collect entire large datasets together in one place for cross- dataset exploration without data management barrier Provides a foundation, both community and infrastructure, upon which to build purpose-built applications and APIs bigger and faster than before
  • 19. How You Can Fit With GUODA • Make your data available • Data standards to make it relatable to other datasets • Making data available doesn’t end with handoff to the aggregator - where is your data used? • Support workforce development • Support next-wave things like ePANDDA • Collaborate with GUODA when starting your own research
  • 20. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. www.idigbio.org facebook.com/iDigBio twitter.com/iDigBio vimeo.com/idigbio idigbio.org/rss-feed.xml webcal://www.idigbio.org/events-calendar/export.ics Thank you! http://guoda.bio