SlideShare ist ein Scribd-Unternehmen logo
1 von 17
BHL in the cloud A Pilot Project Tom Garnett, BHL Executive Director Chris Freeland, BHL Technical Director http://www.biodiversitylibrary.org
The BiodiversityHeritage Library (BHL) is a global community of naturalhistorylibraries and research institutions who have formed a partnership to digitize and makeavailable the world'sbiodiversityliterature.  Now Online:  82,000 volumes  31 million pages BHL Partners http://www.biodiversitylibrary.org
BHL by the Book One 380 pg (avg) volume = multiple files, varying sizes, relationships among them PDF OCR JP2 XML > 70TB, growing every day…
MBL: Redundant cluster MOBOT: Database & web application Internet Archive: Digitized content / files Current distributed infrastructure Content Metadata
Access System – files, metadata & services needed to deliver content. BHL Vision: Global Infrastructure Preservation System – multiple redundant copies of all digitized content. Replicate Data Ingest Sync Data Ingest Data Ingest
Motivation for joining pilot Community interest in cloud storage (Funding organizations, too!) Wanted to evaluate applicability of cloud storage for large-scale digitization activities Solutions for efficient transfer of 10-100s TB data Lower cost alternatives to maintaining large data centers
BHL as a research space BHL nodes are autonomous centers serving the digitized texts under their applications in response to users. But the BHL corpus as whole is a data set of biodiversity data in its own right.  Embedded in it are: Predator/prey relationships Habitat/distribution data Host/parasite data Pathogen/disease vector data Third party researchers and projects are interested in mining the BHL texts for multiple research needs.  One site for serving/accessing/downloading digital texts AND for data mining is messy.  Separate out and put a version of the corpus in a public-like cloud space.
BHL Policy Challenges Money - At present in the US, one BHL member library (MBL) is willing to provide essentially free redundant hosting.  This is a very attractive financial offer.  Since the MBL is BHL member it provides a level of administrative commitment.  If this changes, DuraCloud becomes very attractive. Skill level  - Multiple global partners needing all or some of the current holdings  - have varying levels of technical skills.  For some shipping hard drives might be easier. For some uploading to and downloading from DuraCloud might be preferable.   Timing – at the time of the closing of the pilot our partners, while very close are not quite ready for the initial large data transfer.  As they get their marbles lined up, we can evaluate DuraCloud as a transfer mechanism on a node-by-node basis. Control – in cultural-scientific digital projects no clear models using DuraCloud.  Early-adopter paranoia.
Data Transfer Methods & Limitations NodeB NodeA Problems: Hardware failure, data loss, shipping fees vs NodeB NodeA Problems: Available bandwidth, upload/download fees
Data transfer: Cloud vs. Cluster  Inventory & audit lists Checksums for data integrity Heavy lifting at BHL scale, regardless of endpoint weeks->months, not minutes->days Differences In cluster environment, have to be intimately involved in hardware decisions, maintenance, troubleshooting In cloud environment, those worries are part of your fee
Challenges for adopting cloud storage BHL is embedded in longstanding institutions with megainfrastructure. Already support data storage & maintenance at BHL scale Little funding for alternative infrastructure / storage Current storage is (really, truly) free through Internet Archive Costs associated with download / use of content BHL is a global resource for a broad community User community wants to “do things” with data
Lessons Learned Cloud infrastructure & applicability to BHL are no longer a mystery Nothing is free Except when it is Cloud storage provides ability to quickly scale infrastructure No lost time procuring & configuring hardware Useful for the right kinds of datasets It’s not the size of the corpus, it’s the size of the files Huge files are problematic
Outcomes 10-13TB transferred from Internet Archive to DuraCloud over wire Simple, without bandwidth limitations Became intimately familiar with our data Larger files in corpus than expected (GB+ files) Issues with “checksums” Need to know your data to efficiently manage it Spent less time moving data than checking / verifying data
Perceptions about cloud infrastructure after pilot participation More possibilities than expected: Features Movement Support available from commercial providers. Increasing menus of choices There is no silver bullet Cloud is just a different endpoint for file storage It doesn’t solve all problems related to repository management
Future opportunities for cloud infrastructure & BHL Depending of BHL partner needs/abilities us DuraCloud to transfer/synch files Seek research grants for data mining and include line items for DuraCloud hosting of BHL “research space” for multiple informatics projects. If “free” turns into “not so free” use DuraCloud as ongoing redundant preservation storage. As we explore synchronization across projects, is DuraCloud an alternative?
BHL in the cloud A Pilot Project Tom Garnett, BHL Executive Director Chris Freeland, BHL Technical Director http://www.biodiversitylibrary.org

Weitere ähnliche Inhalte

Was ist angesagt?

INN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementINN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementSimen Smaaberg
 
Libraries in the_cloud_mnlib11
Libraries in the_cloud_mnlib11Libraries in the_cloud_mnlib11
Libraries in the_cloud_mnlib11donovan_lambright
 
CLOUD COMPUTING AND ITS APPLICATIONS IN DIGITAL LIBRARY SERVICES
CLOUD COMPUTING AND ITS APPLICATIONS IN  DIGITAL LIBRARY SERVICESCLOUD COMPUTING AND ITS APPLICATIONS IN  DIGITAL LIBRARY SERVICES
CLOUD COMPUTING AND ITS APPLICATIONS IN DIGITAL LIBRARY SERVICESKoushik Pathak
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
 
Nabil Sultan. The disruptive and democratizing credentials of cloud computing
Nabil Sultan. The disruptive and democratizing credentials of  cloud computingNabil Sultan. The disruptive and democratizing credentials of  cloud computing
Nabil Sultan. The disruptive and democratizing credentials of cloud computingCBOD ANR project U-PSUD
 
Migrate to Kolab, now! - How to simplify the complex task of migrating data b...
Migrate to Kolab, now! - How to simplify the complex task of migrating data b...Migrate to Kolab, now! - How to simplify the complex task of migrating data b...
Migrate to Kolab, now! - How to simplify the complex task of migrating data b...audriga.com
 
The Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyThe Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyAlluxio, Inc.
 
Efficient and reliable hybrid cloud architecture for big database
Efficient and reliable hybrid cloud architecture for big databaseEfficient and reliable hybrid cloud architecture for big database
Efficient and reliable hybrid cloud architecture for big databaseijccsa
 
New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...Derek Keats
 
Strongbox Data Storage Podcast
Strongbox Data Storage PodcastStrongbox Data Storage Podcast
Strongbox Data Storage Podcastinside-BigData.com
 
Cloud Computing in Academic Libraries A Review
Cloud Computing in Academic Libraries A ReviewCloud Computing in Academic Libraries A Review
Cloud Computing in Academic Libraries A Reviewijtsrd
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataIan Foster
 
A proximity aware interest-clustered p2 p file sharing system
A proximity aware interest-clustered p2 p file sharing systemA proximity aware interest-clustered p2 p file sharing system
A proximity aware interest-clustered p2 p file sharing systemLeMeniz Infotech
 
Cloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCO
Cloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCOCloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCO
Cloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCOStorage Switzerland
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers
 

Was ist angesagt? (20)

INN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementINN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for management
 
Untitled 1
Untitled 1Untitled 1
Untitled 1
 
Libraries in the_cloud_mnlib11
Libraries in the_cloud_mnlib11Libraries in the_cloud_mnlib11
Libraries in the_cloud_mnlib11
 
CLOUD COMPUTING AND ITS APPLICATIONS IN DIGITAL LIBRARY SERVICES
CLOUD COMPUTING AND ITS APPLICATIONS IN  DIGITAL LIBRARY SERVICESCLOUD COMPUTING AND ITS APPLICATIONS IN  DIGITAL LIBRARY SERVICES
CLOUD COMPUTING AND ITS APPLICATIONS IN DIGITAL LIBRARY SERVICES
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Nabil Sultan. The disruptive and democratizing credentials of cloud computing
Nabil Sultan. The disruptive and democratizing credentials of  cloud computingNabil Sultan. The disruptive and democratizing credentials of  cloud computing
Nabil Sultan. The disruptive and democratizing credentials of cloud computing
 
Migrate to Kolab, now! - How to simplify the complex task of migrating data b...
Migrate to Kolab, now! - How to simplify the complex task of migrating data b...Migrate to Kolab, now! - How to simplify the complex task of migrating data b...
Migrate to Kolab, now! - How to simplify the complex task of migrating data b...
 
The Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyThe Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and Resiliency
 
Efficient and reliable hybrid cloud architecture for big database
Efficient and reliable hybrid cloud architecture for big databaseEfficient and reliable hybrid cloud architecture for big database
Efficient and reliable hybrid cloud architecture for big database
 
New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...
 
Strongbox Data Storage Podcast
Strongbox Data Storage PodcastStrongbox Data Storage Podcast
Strongbox Data Storage Podcast
 
Cloud Computing in Academic Libraries A Review
Cloud Computing in Academic Libraries A ReviewCloud Computing in Academic Libraries A Review
Cloud Computing in Academic Libraries A Review
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing Data
 
Big data
Big dataBig data
Big data
 
A proximity aware interest-clustered p2 p file sharing system
A proximity aware interest-clustered p2 p file sharing systemA proximity aware interest-clustered p2 p file sharing system
A proximity aware interest-clustered p2 p file sharing system
 
Cloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCO
Cloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCOCloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCO
Cloudian Webinar - 7 Key Reasons why Object Storage lowers Storage TCO
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics Amsterdam
 

Andere mochten auch

Holy Grail of Digital Legacy Taxonomic Literature
Holy Grail of Digital Legacy Taxonomic LiteratureHoly Grail of Digital Legacy Taxonomic Literature
Holy Grail of Digital Legacy Taxonomic LiteratureChris Freeland
 
BHL Global Infrastructure - Vision
BHL Global Infrastructure - VisionBHL Global Infrastructure - Vision
BHL Global Infrastructure - VisionChris Freeland
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Chris Freeland
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit downChris Freeland
 
Biodiversity Heritage Library & JSTOR Plant Science
Biodiversity Heritage Library & JSTOR Plant ScienceBiodiversity Heritage Library & JSTOR Plant Science
Biodiversity Heritage Library & JSTOR Plant ScienceChris Freeland
 
BHL: Technologies Supporting Scientists & Librarians
BHL: Technologies Supporting Scientists & LibrariansBHL: Technologies Supporting Scientists & Librarians
BHL: Technologies Supporting Scientists & LibrariansChris Freeland
 
WordPress Beginners Workshop
WordPress Beginners WorkshopWordPress Beginners Workshop
WordPress Beginners WorkshopThe Toolbox, Inc.
 
Drupal8 render pipeline
Drupal8 render pipelineDrupal8 render pipeline
Drupal8 render pipelineMahesh Salaria
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Chris Freeland
 

Andere mochten auch (11)

Holy Grail of Digital Legacy Taxonomic Literature
Holy Grail of Digital Legacy Taxonomic LiteratureHoly Grail of Digital Legacy Taxonomic Literature
Holy Grail of Digital Legacy Taxonomic Literature
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
BHL Global Infrastructure - Vision
BHL Global Infrastructure - VisionBHL Global Infrastructure - Vision
BHL Global Infrastructure - Vision
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
Biodiversity Heritage Library & JSTOR Plant Science
Biodiversity Heritage Library & JSTOR Plant ScienceBiodiversity Heritage Library & JSTOR Plant Science
Biodiversity Heritage Library & JSTOR Plant Science
 
BHL: Technologies Supporting Scientists & Librarians
BHL: Technologies Supporting Scientists & LibrariansBHL: Technologies Supporting Scientists & Librarians
BHL: Technologies Supporting Scientists & Librarians
 
BHL "napkin sketch"
BHL "napkin sketch"BHL "napkin sketch"
BHL "napkin sketch"
 
WordPress Beginners Workshop
WordPress Beginners WorkshopWordPress Beginners Workshop
WordPress Beginners Workshop
 
Drupal8 render pipeline
Drupal8 render pipelineDrupal8 render pipeline
Drupal8 render pipeline
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
 

Ähnlich wie BHL in the Cloud: A Pilot Project with DuraCloud

Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalVMware Tanzu Korea
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland
 
LIBR 243 final - Caroline Helms (1)
LIBR 243 final - Caroline Helms (1)LIBR 243 final - Caroline Helms (1)
LIBR 243 final - Caroline Helms (1)Caroline Helms
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big DataDataWorks Summit
 
Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
Internet of Things and Hadoop
Internet of Things and HadoopInternet of Things and Hadoop
Internet of Things and Hadoopaziksa
 
Cloud Presentation and OpenStack case studies -- Harvard University
Cloud Presentation and OpenStack case studies -- Harvard UniversityCloud Presentation and OpenStack case studies -- Harvard University
Cloud Presentation and OpenStack case studies -- Harvard UniversityBarton George
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
 

Ähnlich wie BHL in the Cloud: A Pilot Project with DuraCloud (20)

Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
Mis cloud computing
Mis cloud computingMis cloud computing
Mis cloud computing
 
LIBR 243 final - Caroline Helms (1)
LIBR 243 final - Caroline Helms (1)LIBR 243 final - Caroline Helms (1)
LIBR 243 final - Caroline Helms (1)
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Internet of Things and Hadoop
Internet of Things and HadoopInternet of Things and Hadoop
Internet of Things and Hadoop
 
Cloud Presentation and OpenStack case studies -- Harvard University
Cloud Presentation and OpenStack case studies -- Harvard UniversityCloud Presentation and OpenStack case studies -- Harvard University
Cloud Presentation and OpenStack case studies -- Harvard University
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
WTIA Cloud Computing Series - Part I: The Fundamentals
WTIA Cloud Computing Series - Part I: The FundamentalsWTIA Cloud Computing Series - Part I: The Fundamentals
WTIA Cloud Computing Series - Part I: The Fundamentals
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Presentation day1oracle 12c
Presentation day1oracle 12cPresentation day1oracle 12c
Presentation day1oracle 12c
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 

Mehr von Chris Freeland

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeChris Freeland
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLAChris Freeland
 
Documenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryDocumenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryChris Freeland
 
Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Chris Freeland
 
Establishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAEstablishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAChris Freeland
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriChris Freeland
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Chris Freeland
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesChris Freeland
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryChris Freeland
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansChris Freeland
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)Chris Freeland
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr LibraryChris Freeland
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureChris Freeland
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHLChris Freeland
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataChris Freeland
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureChris Freeland
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using TropicosChris Freeland
 
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Chris Freeland
 

Mehr von Chris Freeland (20)

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-Igoe
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLA
 
Documenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryDocumenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repository
 
Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015
 
Establishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAEstablishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLA
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in Missouri
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big Challenges
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage Library
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural Librarians
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr Library
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
 
Global BHL Activities
Global BHL ActivitiesGlobal BHL Activities
Global BHL Activities
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHL
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic data
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated Literature
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using Tropicos
 
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
 

Kürzlich hochgeladen

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Kürzlich hochgeladen (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

BHL in the Cloud: A Pilot Project with DuraCloud

  • 1. BHL in the cloud A Pilot Project Tom Garnett, BHL Executive Director Chris Freeland, BHL Technical Director http://www.biodiversitylibrary.org
  • 2. The BiodiversityHeritage Library (BHL) is a global community of naturalhistorylibraries and research institutions who have formed a partnership to digitize and makeavailable the world'sbiodiversityliterature. Now Online: 82,000 volumes 31 million pages BHL Partners http://www.biodiversitylibrary.org
  • 3.
  • 4. BHL by the Book One 380 pg (avg) volume = multiple files, varying sizes, relationships among them PDF OCR JP2 XML > 70TB, growing every day…
  • 5. MBL: Redundant cluster MOBOT: Database & web application Internet Archive: Digitized content / files Current distributed infrastructure Content Metadata
  • 6. Access System – files, metadata & services needed to deliver content. BHL Vision: Global Infrastructure Preservation System – multiple redundant copies of all digitized content. Replicate Data Ingest Sync Data Ingest Data Ingest
  • 7. Motivation for joining pilot Community interest in cloud storage (Funding organizations, too!) Wanted to evaluate applicability of cloud storage for large-scale digitization activities Solutions for efficient transfer of 10-100s TB data Lower cost alternatives to maintaining large data centers
  • 8. BHL as a research space BHL nodes are autonomous centers serving the digitized texts under their applications in response to users. But the BHL corpus as whole is a data set of biodiversity data in its own right. Embedded in it are: Predator/prey relationships Habitat/distribution data Host/parasite data Pathogen/disease vector data Third party researchers and projects are interested in mining the BHL texts for multiple research needs. One site for serving/accessing/downloading digital texts AND for data mining is messy. Separate out and put a version of the corpus in a public-like cloud space.
  • 9. BHL Policy Challenges Money - At present in the US, one BHL member library (MBL) is willing to provide essentially free redundant hosting. This is a very attractive financial offer. Since the MBL is BHL member it provides a level of administrative commitment. If this changes, DuraCloud becomes very attractive. Skill level - Multiple global partners needing all or some of the current holdings - have varying levels of technical skills. For some shipping hard drives might be easier. For some uploading to and downloading from DuraCloud might be preferable. Timing – at the time of the closing of the pilot our partners, while very close are not quite ready for the initial large data transfer. As they get their marbles lined up, we can evaluate DuraCloud as a transfer mechanism on a node-by-node basis. Control – in cultural-scientific digital projects no clear models using DuraCloud. Early-adopter paranoia.
  • 10. Data Transfer Methods & Limitations NodeB NodeA Problems: Hardware failure, data loss, shipping fees vs NodeB NodeA Problems: Available bandwidth, upload/download fees
  • 11. Data transfer: Cloud vs. Cluster Inventory & audit lists Checksums for data integrity Heavy lifting at BHL scale, regardless of endpoint weeks->months, not minutes->days Differences In cluster environment, have to be intimately involved in hardware decisions, maintenance, troubleshooting In cloud environment, those worries are part of your fee
  • 12. Challenges for adopting cloud storage BHL is embedded in longstanding institutions with megainfrastructure. Already support data storage & maintenance at BHL scale Little funding for alternative infrastructure / storage Current storage is (really, truly) free through Internet Archive Costs associated with download / use of content BHL is a global resource for a broad community User community wants to “do things” with data
  • 13. Lessons Learned Cloud infrastructure & applicability to BHL are no longer a mystery Nothing is free Except when it is Cloud storage provides ability to quickly scale infrastructure No lost time procuring & configuring hardware Useful for the right kinds of datasets It’s not the size of the corpus, it’s the size of the files Huge files are problematic
  • 14. Outcomes 10-13TB transferred from Internet Archive to DuraCloud over wire Simple, without bandwidth limitations Became intimately familiar with our data Larger files in corpus than expected (GB+ files) Issues with “checksums” Need to know your data to efficiently manage it Spent less time moving data than checking / verifying data
  • 15. Perceptions about cloud infrastructure after pilot participation More possibilities than expected: Features Movement Support available from commercial providers. Increasing menus of choices There is no silver bullet Cloud is just a different endpoint for file storage It doesn’t solve all problems related to repository management
  • 16. Future opportunities for cloud infrastructure & BHL Depending of BHL partner needs/abilities us DuraCloud to transfer/synch files Seek research grants for data mining and include line items for DuraCloud hosting of BHL “research space” for multiple informatics projects. If “free” turns into “not so free” use DuraCloud as ongoing redundant preservation storage. As we explore synchronization across projects, is DuraCloud an alternative?
  • 17. BHL in the cloud A Pilot Project Tom Garnett, BHL Executive Director Chris Freeland, BHL Technical Director http://www.biodiversitylibrary.org