2. Authors
• Bruce Kozuma is a projectprogram manager in the Broad
Information Technology Services (BITS) department with
experience in software development, operations, and IT in
industries such as manufacturing, telecommunications,
biotechnology, and biomedical research.
• Paul Clemons is director of computational chemical biology
research in the Center for the Science of Therapeutics
(CSofT) at the Broad Institute. He and his team use
quantitative measurement, computational, and
visualization techniques to enable systematic use of small
molecules to explore biology, especially disease biology.
2
3. About the Broad Institute
• A collaborative community
pioneering a new model of
biomedical science; views itself as
an experiment in a new way of
doing science, empowering
researchers to:
– Act nimbly
– Work boldly
– Share openly
– Reach globally 3
4. Current Cell Line
Management State
• Multiple groups creating and using cell lines at the Broad, e.g.,
– Project Achilles, Profiling Relative Inhibition Simultaneously in Mixtures
(PRSIM), Cancer Cell Line Encyclopedia (CCLE), Center for the Science of
Therapeutics (CSofT), Connectivity Map (CMAP), Center for the
Development of Therapeutics (CDoT)
• Some canonical sources of cell-line data at Broad, e.g.,
– Cancer Cell Line Dependencies Database (CDDB)
• However!
– Limited coordination in definitions of what constitutes a unique cell line and
how changes are made to that definition over time
– No effective mechanisms to curate, register, or search such definitions
– No automated refresh cycle for data in CDDB
4
5. Why is this a Problem?
• Lack of a common platform inhibits collaboration between
groups since they have to rely on external sources to know
what internal research has been done on a cell line
• When there is collaboration, e.g., with one group supplying
cell lines and data to another group, may have issues with
updating metadata, e.g., primary site change
• Lack of a common vocabulary leads to data quality issues,
e.g., what do you mean by Doubling Time
• Velocity of scientific discovery is slower as a result
5
6. Practical examples
6
• What metadata is
tracked at what level?
• Who decides the
metadata categories and
values?
• How do we promote
project-specific
metadata to parental
cell lines?
7. Practical examples
• Who decides two or more cell lines are the same thing?
– Example: A375 and unknown cell line
– Heuristic: They are the same cell line if they have the same genomic
fingerprint and same source (e.g., individual and tissue type) –
more measurements of sameness to be added later
7
8. Desired Situation
• Common cell line metadata categories and data
• Defined, published, flexible processes for collaborative
reviewapproval of metadata categories and data (e.g.,
intake, change, promotion)
• Retain ability for groups to work independently on project-
specific metadata and data
• Technology that enables wide-spread sharing of cell-line
metadata categories and data, inside and outside Broad
8
10. Hypothesis: Manufacturing Practices
& Appropriate Technology Can Help
• Use best practices from manufacturing around
master data management to build necessary
organizational practices
• Use technology to enable organization practices
• Principles:
– Technology without organizational practices is a waste
– Organizational practices without enabling, sustainable
use of technology will wither
10
11. Cell Line Master Data Review Board
• Establish a cell line master data review board to
review metadata categories and data
stewardshipmanagement practices
– Draws from Material Review Boards in manufacturing
– Provides a forum for a “coalition of the willing” to come
to consensus about metadata categories before
categoriesvalues are established and curate changes
to before making them
– Provides institutional sponsorship, above the level of
individual projects, while being collaborative
11
12. Cell Line Master Data Review Board
Proposed SponsorshipMembership
• Office of the Chief Data Officer sponsors the board to
provide cross-project arbitration
• Initial membership by organization:
– Office of the Chief Data Officer
– Office of the Chief Science Officer
– Developer of the institutional database e.g., CSofT
– Projects creating cell lines and metadata, e.g., PRISM, Achilles
– Groups ingesting cell line metadata, e.g., Proteomics, CDoT
– BITS as facilitator (works across organization, neutral about science)
– Ad hoc members 12
13. Cell Line Master Data Review Board
Proposed PoliciesProcedures
• Board mechanics: Governance, changes to membership, etc.
• Develop canonical source of parental cell line definition
– Assumes can use existing metadata categories and values
• Initial methods
13
• Register new cell lines
• Add new metadata categories
• Add new metadata to existing
categories
• Change metadata categories or
values for existing cell lines
• Track provenance of names and
annotations (differences left to end
users to resolve)
14. Framework for Sharing Cell Line
Metadata
• Use institutional database as the canonical source of cell line
metadata
• Provide means of ingesting institutional data into local data
management systems to link project specific data to
parental cell line data
• In the local data management system, have a common
registry of parental cell lines (available to all) and daughter
cell lines (project specific by default)
• Preserve heredity of cell lines and allow searching by such
14
15. Institutional Cell Line Database
Sample Entity Relationship Diagram
15
• Tracks multiple names and annotations (e.g., lineage) and
the source of these claims
• Has no concept of samples or instances (annotates the
abstract entity only)
16. Data exchange via Java Script Object Notation (JSON) file:
cell_sample = {
cell_sample_names: [
{cell_name_type: "CCLE",
cell_sample_name: "A375_SKIN“},
{cell_name_type: "cddb",
cell_sample_name: "30"},
{cell_name_type: “ATCC",
cell_sample_name: "A-375 [A375]
(ATCC® CRL-1619™)"}]
}
Institutional Cell Line Database
Sample Data Exchange Mechanism
16
• cell_sample: Name space for a cell
line name, e.g., CCLE, CDDB, ATCC
• cell_name_type: Name for a cell line
and internal priority of that name,
e.g., may prefer one name to another
name
• cell_sample_name: array of names
for a cell line, e.g.,
– CCLE: A375_SKIN
– CDDB: 30
– ATCC: A-375 [A375] (ATCC® CRL-1619™)
17. Local Data Management System:
Laboratory Data Management (LDM)
• Project for BITS to provide centrally-managedsupported
solutions for management of laboratory data, divided into
functions:
– Data capturearchive (instruments and other sources)
– Container inventoryregistration (chemical,
biological, hybrid)sample management
– Core Electronic Laboratory Notebook (ELN, experiment
documentationIP protectionlinking to data)
– Dataworkflow management
– Data analysisvisualization 17
20. Next Steps for Sharing Cell Line
Metadata
• Work out data privacyclassification restrictions
• Phased implementation for sharing data from institutional database cell
line database with external systems like ArxLab
– Phase 1: Import static list (e.g., JSON file) of parental cell lines (~117K) and
synonyms into ArxLab Registration with type-ahead to auto complete
names, e.g., A37 shows A375
– Phase 2: Add resolution of entered names to a common cell line ID and
preferred name, e.g., entering A375_SKIN resolves to A375 upon entry
– Phase 3: Automatic update LDM via periodic push from institutional cell line
database, including setting up legal framework for data distribution
20
21. Acknowledgements
Achilles
Francesca Vazquez
Sasha Pantel
Nicole Dabkowski
Phil Montgomery
Glenn Cowley
PRISM
Chris Mader
Jen Roth
Sam Bender
Massami Laird
Ed McBride
21
CDDB Data Curation
Paul Clemons
Mahmoud Ghandi
Shuba Gopal
Gregory Gydush
Barbara Weir
Broad Management
Alex Burgin
Anthony Philippakis
Scott Sutherland
Broad Information
Technology Services
Chris Dwan
Eric Jones
Arxspan
Jeff Carter
Kate Hardy
23. Summary – Background
• One of the key challenges in conducting research in a diverse and dynamic
organization like the Broad Institute is connecting islands of related data
• Since scientific groups have traditionally been separated from each other,
relying on each other as internal suppliers and customers, their data have
similarly been separated; it is not uncommon to have two groups working on the
same cell line but have no means of finding out about each other's work, partially
due to different means of tracking cell-line data
• The Broad Institute has collaborated with Arxspan to develop a configuration of
ArxLab to share a common registry of parental cell lines, allowing different
groups to have a common vocabulary about cell lines and opening
collaboration possibilities for both new science and accelerated progress on
existing science
23
24. What You Can Gain – Background
• Gain insight into how the Broad solved a common and intransigent
issue facing a variety of diverse organizations using cloud-based,
current-generation laboratory data-management software in a manner
that can be reapplied in a variety of situations
• See how different departments within the Broad worked
collaboratively with Arxspan to solve this issue in a horizontal manner,
i.e., differently from either a bottom up or top down approach
• Shows how existing technology can be extended in demanding
scientific environments to solve long-standing collaboration issues
within a leading biomedical research organization
24
Hinweis der Redaktion
More about Bruce at LinkedIn: https://www.linkedin.com/in/bkozuma
More about Paul at LinkedIn: https://www.linkedin.com/in/pclemons
More about Paul at the Broad Institute: http://www.broadinstitute.org/scientific-community/science/programs/csoft/chemical-biology/paul-clemons
More about the Broad Institute of MIT and Harvard: http://www.broadinstitute.org
More about Achilles: http://www.broadinstitute.org/Achilles
More about PRISM: https://www.broadinstitute.org/software/cprg/?q=node/67
More about CCLE: http://www.broadinstitute.org/ccle
More about CSofT: http://www.broadinstitute.org/scientific-community/science/programs/csoft/center-science-therapeutics
More about CMAP: http://www.broadinstitute.org/cmap
More about CDoT: