The document discusses the Neuroscience Information Framework (NIF), an initiative by the NIH Blueprint to provide a single access point for searching across multiple neuroscience databases and data types. NIF aims to maximize access to and utility of worldwide neuroscience resources by creating a consistent framework for describing resources and enabling simultaneous searches. It notes that neuroscience data exists in many forms, from raw data to processed data to claims, across multiple scales and data types. NIF is designed to rapidly integrate these diverse resources through a tiered system that has a low barrier for data providers to participate.
Driving Behavioral Change for Information Management through Data-Driven Gree...
Neuroscience Data Needs Multiple Databases and Formats
1. Maryann
E.
Martone,
Ph.
D.
University
of
California,
San
Diego
2. Neuroscience
is
unlikely
to
be
served
by
a
few
large
databases
like
the
genomics
and
proteomics
community
Whole
brain
data
(20
um
microscopic
MRI)
Mosiac
LM
images
(1
GB+)
ConvenNonal
LM
images
Individual
cell
morphologies
EM
volumes
&
reconstrucNons
Solved
molecular
structures
No
single
technology
serves
these
all
equally
well.
Mul6ple
data
types;
mul6ple
scales;
mul6ple
databases
5. • NIF’s
mission
is
to
maximize
the
awareness
of,
access
to
and
uNlity
of
research
resources
produced
worldwide
to
enable
bePer
science
and
promote
efficient
use
– NIF
unites
neuroscience
informaNon
without
respect
to
domain,
funding
agency,
insNtute
or
community
– NIF
is
like
a
“Pub
Med”
for
all
biomedical
resources
and
a
“Pub
Med
Central”
for
databases
– Makes
them
searchable
from
a
single
interface
– PracNcal
and
cost-‐effecNve;
tries
to
be
sensible
– Learned
a
lot
about
current
data
prac6ces
The
Neuroscience
InformaNon
Framework
is
an
iniNaNve
of
the
NIH
Blueprint
consorNum
of
insNtutes
hPp://neuinfo.org
6. h=p://neuinfo.org
June10,
2013
dkCOIN
InvesNgator's
Retreat
6
• A
portal
for
finding
and
using
neuroscience
resources
A
consistent
framework
for
describing
resources
Provides
simultaneous
search
of
mulNple
types
of
informaNon,
organized
by
category
Supported
by
an
expansive
ontology
for
neuroscience
UNlizes
advanced
technologies
to
search
the
“hidden
web”
UCSD,
Yale,
Cal
Tech,
George
Mason,
Washington
Univ
Literature
Database
FederaNon
Registry
7. We’d
like
to
be
able
to
find:
• What
is
known****:
– What
are
the
projecNons
of
hippocampus?
– Is
GRM1
expressed
In
cerebral
cortex?
– What
genes
have
been
found
to
be
upregulated
in
chronic
drug
abuse
in
adults
– What
animal
models
have
similar
phenotypes
to
Parkinson’s
disease?
– What
studies
used
my
polyclonal
anNbody
against
GABA
in
humans?
• What
is
not
known:
– ConnecNons
among
data
– Gaps
in
knowledge
A
framework
makes
it
easier
to
address
these
quesNons
8.
9. With
the
thousands
of
databases
and
other
informaNon
sources
available,
simple
descripNve
metadata
will
not
suffice
10. • NIF
curators
• NominaNon
by
the
community
• Semi-‐automated
text
mining
pipelines
NIF
Registry
Requires
no
special
skills
Site
map
available
for
local
hosNng
• NIF
Data
FederaNon
• DISCO
interop
• Requires
some
programming
skill
• Open
Source
Brain
<
2
hr
Two
Nered
system:
low
barrier
to
entry
11. Current
Planned
DISCO
Dashboard
Func6ons
• Ingest
Script
Manager
• Public
Script
Repository
• Data
&
Event
Tracker
• Versioning
System
• Curator
Tool
• Data
Transformer
Manager
June10,
2013
dkCOIN
InvesNgator's
Retreat
11
Luis
Marenco,
Rixin
Wang,
Perrry
Miller,
Gordon
Shepherd
Yale
University
12. NIF
was
designed
to
be
populated
rapidly
with
progressive
refinement
13. Databases
come
in
many
shapes
and
sizes
• Primary
data:
– Data
available
for
reanalysis,
e.g.,
microarray
data
sets
from
GEO;
brain
images
from
XNAT;
microscopic
images
(CCDB/CIL)
• Secondary
data
– Data
features
extracted
through
data
processing
and
someNmes
normalizaNon,
e.g,
brain
structure
volumes
(IBVD),
gene
expression
levels
(Allen
Brain
Atlas);
brain
connecNvity
statements
(BAMS)
• TerNary
data
– Claims
and
asserNons
about
the
meaning
of
data
• E.g.,
gene
upregulaNon/
downregulaNon,
brain
acNvaNon
as
a
funcNon
of
task
• Registries:
– Metadata
– Pointers
to
data
sets
or
materials
stored
elsewhere
• Data
aggregators
– Aggregate
data
of
the
same
type
from
mulNple
sources,
e.g.,
Cell
Image
Library
,SUMSdb,
Brede
• Single
source
– Data
acquired
within
a
single
context
,
e.g.,
Allen
Brain
Atlas
Researchers
are
producing
a
variety
of
informaNon
arNfacts
using
a
mulNtude
of
technologies
14. Hippocampus
OR
“Cornu
Ammonis”
OR
“Ammon’s
horn”
Query
expansion:
Synonyms
and
related
concepts
Boolean
queries
Data
sources
categorized
by
“data
type”
and
level
of
nervous
system
Common
views
across
mulNple
sources
Tutorials
for
using
full
resource
when
geong
there
from
NIF
Link
back
to
record
in
original
source
15. Connects
to
Synapsed
with
Synapsed
by
Input
region
innervates
Axon
innervates
Projects
to
Cellular
contact
Subcellular
contact
Source
site
Target
site
Each
resource
implements
a
different,
though
related
model;
systems
are
complex
and
difficult
to
learn,
in
many
cases
16.
17. • You
(and
the
machine)
have
to
be
able
to
find
it
– Accessible
through
the
web
– Structured
or
semi-‐structured
– AnnotaNons
• You
(and
the
machine)
have
to
be
able
to
use
it
– Data
type
specified
and
in
an
acNonable
form
• You
(and
the
machine)
have
to
know
what
the
data
mean
• SemanNcs
• Context:
Experimental
metadata
• Provenance:
where
did
they
come
from
18. Knowledge
in
space
and
spaNal
relaNonships
(the
“where”)
Knowledge
in
words,
terminologies
and
logical
relaNonships
(the
“what”)
19. Purkinje
Cell
Axon
Terminal
Axon
DendriNc
Tree
DendriNc
Spine
Dendrite
Cell
body
Cerebellar
cortex
There
is
liPle
obvious
connecNon
between
data
sets
taken
at
different
scales
using
different
microscopies
without
an
explicit
representaNon
of
the
biological
objects
that
the
data
represent
20. • NIF
covers
mulNple
structural
scales
and
domains
of
relevance
to
neuroscience
• Aggregate
of
community
ontologies
with
some
extensions
for
neuroscience,
e.g.,
Gene
Ontology,
Chebi,
Protein
Ontology
NIFSTD
Organism
NS
FuncNon
Molecule
InvesNgaNon
Subcellular
structure
Macromolecule
Gene
Molecule
Descriptors
Techniques
Reagent
Protocols
Cell
Resource
Instrument
DysfuncNon
Quality
Anatomical
Structure
21. Brain
Cerebellum
Purkinje
Cell
Layer
Purkinje
cell
neuron
has
a
has
a
has
a
is
a
• Ontology:
an
explicit,
formal
representaNon
of
concepts
relaNonships
among
them
within
a
parNcular
domain
that
expresses
human
knowledge
in
a
machine
readable
form
• Branch
of
philosophy:
a
theory
of
what
is
• e.g.,
Gene
ontologies
22. • Express
neuroscience
concepts
in
a
way
that
is
machine
readable
– Synonyms,
lexical
variants
– DefiniNons
• Provide
means
of
disambiguaNon
of
strings
– Nucleus
part
of
cell;
nucleus
part
of
brain;
nucleus
part
of
atom
• Rules
by
which
a
class
is
defined,
e.g.,
a
GABAergic
neuron
is
neuron
that
releases
GABA
as
a
neurotransmiPer
• ProperNes
– Support
reasoning
• Provide
universals
for
navigaNng
across
different
data
sources
– SemanNc
“index”
– Link
data
through
relaNonships
not
just
one-‐to-‐one
mappings
• Provide
the
basis
for
concept-‐based
queries
to
probe
and
mine
data
• Establish
a
semanNc
framework
for
landscape
analysis
MathemaNcs,
Computer
code
or
Esperanto
23. birnlex_1732
Brodmann.1
Explicit
mapping
of
database
content
helps
disambiguate
non-‐unique
and
custom
terminology
24. June10,
2013
24
Aligns
sources
to
the
NIF
semanNc
framework
25.
26. • Search
Google:
GABAergic
neuron
• Search
NIF:
GABAergic
neuron
– NIF
automaNcally
searches
for
types
of
GABAergic
neurons
Types
of
GABAergic
neurons
Search by meaning not by string
27. Equivalence
classes;
restricNons
Arbitrary
but
defensible
• Neurons
classified
by
• Circuit
role:
principal
neuron
vs
interneuron
• Molecular
consNtuent:
Parvalbumin-‐
neurons,
calbindin-‐neurons
• Brain
region:
Cerebellar
neuron
• Morphology:
Spiny
neuron
•
Molecule
Roles:
Drug
of
abuse,
anterograde
tracer,
retrograde
tracer
• Brain
parts:
Circumventricular
organ
• Organisms:
Non-‐human
primate,
non-‐human
vertebrate
• QualiNes:
Expression
level
• Techniques:
Neuroimaging
28. What
genes
are
upregulated
by
drugs
of
abuse
in
the
adult
mouse?
(show
me
the
data!)
Morphine
Increased
expression
Adult
Mouse
29. • NIF
ConnecNvity:
7
databases
containing
connecNvity
primary
data
or
claims
from
literature
on
connecNvity
between
brain
regions
• Brain
Architecture
Management
System
(rodent)
• Temporal
lobe.com
(rodent)
• Connectome
Wiki
(human)
• Brain
Maps
(various)
• CoCoMac
(primate
cortex)
• UCLA
MulNmodal
database
(Human
fMRI)
• Avian
Brain
ConnecNvity
Database
(Bird)
• Total:
1800
unique
brain
terms
(excluding
Avian)
• Number
of
exact
terms
used
in
>
1
database:
42
• Number
of
synonym
matches:
99
• Number
of
1st
order
partonomy
matches:
385
30. hPp://neurolex.org
• SemanNc
MediWiki
• Provide
a
simple
interface
for
defining
the
concepts
required
• Light
weight
semanNcs
• Good
teaching
tool
for
learning
about
semanNc
integraNon
and
the
benefits
of
a
consistent
semanNc
framework
• Community
based:
• Anyone
can
contribute
their
terms,
concepts,
things
• Anyone
can
edit
• Anyone
can
link
• Accessible:
searched
by
Google
• Growing
into
a
significant
knowledge
base
for
neuroscience
• InternaNonal
NeuroinformaNcs
CoordinaNng
Facility
Demo
D03
Larson
et
al,
FronNers
in
NeuroinformaNcs,
in
press
31. • Neurolex
provides
an
on-‐line
computable
index
for
expressing
models
in
semanNc
terms,
and
linking
to
other
knowledge
and
data
• Implemented
forms
for
certain
types
of
enNNes
• Neuroscience
knowledge
in
the
web
Pages
are
linked
through
properNes;
Knowledge-‐base
built
through
cross-‐
modular
relaNons
and
links
to
data;
red
links
32. • >
1000
Dicom
Terms
– Karl
Helmer
– Data
Sharing
Task
Force
• Tasks
and
CogniNve
Concepts
from
CogniNve
Atlas
– Russ
Poldrack
• >280
Neurons
– Gordon
Shepherd
and
30
world
wide
experts
• ~500
fly
neurons
from
Fly
Anatomy
Ontology
– David
Osumi-‐Sutherland
• >1200
Brain
parcellaNons
`20,000
concepts:
Spreadsheet
downloads,
through
NIF
Web
Services,
SPARQL
endpoint
200,000
edits
150
contributors
34. Neurolex:
>
1
million
triples
Dr.
Yi
Zeng:
Chinese
neural
knowledge
base
NIF
Cell
Graph
35. 1. Look
brain
region
up
in
NeuroLex
2. Look
up
cells
contained
in
the
brain
region
3. Find
those
cells
that
are
known
to
project
out
of
that
brain
region
4. Look
up
the
neurotransmiPers
for
those
cells
5. Determine
whether
those
neurotransmiPers
are
known
to
be
excitatory
or
inhibitory
6. Report
the
projecNon
as
excitatory
or
inhibitory,
and
report
the
enNre
chain
of
logic
with
links
back
to
the
wiki
pages
where
they
were
made
7. Make
sure
user
can
get
back
to
each
statement
in
the
logic
chain
to
edit
it
if
they
think
it
is
wrong
Stephen
Larson
CHEBI:18243
Are
projecNons
from
the
VTA
excitatory
or
inhibitory?
36. • INCF
Project
– Neuron
Registry
– >
30
experts
worldwide
– Fill
out
neuron
pages
in
Neurolex
Wiki
– Led
by
Dr.
Gordon
Shepherd
Soma
locaNon
Dendrite
locaNon
Axon
locaNon
0
50
100
150
200
250
300
Number
Total
redlinks
easy
fixes
hard
fixes
Soma
locaNon
Dendrite
locaNon
Axon
locaNon
Social
networks
and
community
sites
let
us
learn
things
from
the
collecNve
behavior
of
contributors
37. 37
neurolex.org: Semantic Wiki
• INCF Community encyclopedia
• Define all vocabulary, terms,
protocols, brain structures, diseases,
etc
• Living review articles
• Links to data, models and literature
• Semantic organization, search,
analysis and integration
• Searchable via the web
• Global directory of all shared
vocabularies, CDEs, etc
Slide
courtesy
of
Sean
Hill:
InternaNonal
NeuroinformaNcs
CoordinaNng
Facility
39. • NIF
can
be
used
to
survey
the
data
landscape
• Analysis
of
NIF
shows
mulNple
databases
with
similar
scope
and
content
• Many
contain
parNally
overlapping
data
• Data
“flows”
from
one
resource
to
the
next
– Data
is
reinterpreted,
reanalyzed
or
added
to
• Is
duplicaNon
good
or
bad?
NIF
is
trying
to
make
it
easier
to
work
with
diverse
data
40. NIF
is
in
a
unique
posiNon
to
answer
quesNons
about
the
neuroscience
landscape:
Kepler
Workflow
engine
+
NIF
semanNcs
Where
are
the
data?
Striatum
Hypothalamus
Olfactory
bulb
Cerebral
cortex
Brain
Brain
region
Data
source
41. ∞
What
is
easily
machine
processable
and
accessible
What
is
potenNally
knowable
What
is
known:
Literature,
images,
human
knowledge
Unstructured;
Natural
language
processing,
enNty
recogniNon,
image
processing
and
analysis;
paywalls
communicaNon
Abstracts
vs
full
text
vs
tables
etc
42. Closed
world
vs
open
world
We
know
a
lot
about
some
things
and
less
about
others;
some
of
NIF’s
sources
are
comprehensive;
others
are
highly
biased
But...NIF
has
>
2M
anNbodies,
338,000
model
organisms,
and
3
million
microarray
records
43. Neocortex
Olfactory
bulb
Neostriatum
Cochlear
nucleus
All
neurons
with
cell
bodies
in
the
same
brain
region
are
grouped
together
ProperNes
in
Neurolex
44. Exposing
knowledge
gaps
and
biases
Where
are
the
data?
Striatum
Hypothalamus
Olfactory
bulb
Cerebral
cortex
Brain
Brain
region
Data
source
Funding
45. • Gemma:
Gene
ID
+
Gene
Symbol
• DRG:
Gene
name
+
Probe
ID
• Gemma
presented
results
relaNve
to
baseline
chronic
morphine;
DRG
with
respect
to
saline,
so
direcNon
of
change
is
opposite
in
the
2
databases
•
Analysis:
• 1370
statements
from
Gemma
regarding
gene
expression
as
a
funcNon
of
chronic
morphine
• 617
were
consistent
with
DRG;
over
half
of
the
claims
of
the
paper
were
not
confirmed
in
this
analysis
• Results
for
1
gene
were
opposite
in
DRG
and
Gemma
• 45
did
not
have
enough
informaNon
provided
in
the
paper
to
make
a
judgment
RelaNvely
simple
standards
would
make
life
easier
46. NIF
favors
a
hybrid,
Nered,
federated
system
• Domain
knowledge
– Ontologies
• Claims,
models
and
observaNons
– Virtuoso
RDF
triples
– Model
repositories
• Data
– Data
federaNon
– SpaNal
data
– Workflows
• NarraNve
– Full
text
access
Neuron
Brain
part
Disease
Organism
Gene
Caudate
projects
to
Snpc
Grm1
is
upregulated
in
chronic
cocaine
Betz
cells
degenerate
in
ALS
NIF
provides
the
tentacles
that
connect
the
pieces:
a
new
type
of
enNty
for
21st
century
science
Technique
People
47. Scholar
Library
Scholar
Publisher
FORCE11.org:
Future
of
research
communicaNons
and
e-‐scholarship
48. Scholar
Consumer
Libraries
Data
Repositories
Code
Repositories
Community
databases/
pla}orms
OA
Curators
Social
Networks
Social
Networks
Social
Networks
Peer
Reviewers
NarraNve
Workflows
Data
Models
MulNmedia
NanopublicaNons
Code
49. • Of
the
~
4000
columns
that
NIF
queries,
~1300
map
to
one
of
our
core
categories:
– Organism
– Anatomical
structure
– Cell
– Molecule
– FuncNon
– DysfuncNon
– Technique
• 30-‐50%
of
NIF’s
queries
autocomplete
• When
NIF
combines
mulNple
sources,
a
set
of
common
fields
emerges
– >Basic
informaNon
models/semanNc
models
exist
for
certain
types
of
enNNes
SemanNc
frameworks
create
spaces
in
which
to
compare
the
current
state
of
data
and
knowledge
50. • Several
powerful
trends
should
change
the
way
we
think
about
our
data:
One
Many
– Many
data
• GeneraNon
of
data
is
geong
easier
shared
data
• Data
space
is
geong
richer:
more
–omes
everyday
• But...compared
to
the
biological
space,
sNll
sparse
– Many
resources:
everyone
wants
to
be
“the”
one
but
e
pluribus
unum
– Many
eyes
• Wisdom
of
crowds
• More
than
one
way
to
interpret
data
– Many
algorithms
• Not
a
single
way
to
analyze
data
– Many
analyNcs
• “Signatures”
in
data
may
not
be
directly
related
to
the
quesNon
for
which
they
were
acquired
but
tell
us
something
really
interesNng
New
works
need
to
be
created
with
an
eye
towards
the
web
and
interoperability
51. Jeff
Grethe,
UCSD,
Co
InvesNgator,
Interim
PI
Amarnath
Gupta,
UCSD,
Co
InvesNgator
Anita
Bandrowski,
NIF
Project
Leader
Gordon
Shepherd,
Yale
University
Perry
Miller
Luis
Marenco
Rixin
Wang
David
Van
Essen,
Washington
University
Erin
Reid
Paul
Sternberg,
Cal
Tech
Arun
Rangarajan
Hans
Michael
Muller
Yuling
Li
Giorgio
Ascoli,
George
Mason
University
Sridevi
Polavarum
Fahim
Imam
Larry
Lui
Andrea
Arnaud
Stagg
Jonathan
Cachat
Jennifer
Lawrence
Svetlana
Sulima
Davis
Banks
Vadim
Astakhov
Xufei
Qian
Chris
Condit
Mark
Ellisman
Stephen
Larson
Willie
Wong
Tim
Clark,
Harvard
University
Paolo
Ciccarese
Karen
Skinner,
NIH,
Program
Officer
(reNred)
Jonathan
Pollock,
NIH,
Program
Officer
And
my
colleagues
in
Monarch,
dkNet,
3DVC,
Force
11
53. 47/50
major
preclinical
published
cancer
studies
could
not
be
replicated
• “The
scienNfic
community
assumes
that
the
claims
in
a
preclinical
study
can
be
taken
at
face
value-‐that
although
there
might
be
some
errors
in
detail,
the
main
message
of
the
paper
can
be
relied
on
and
the
data
will,
for
the
most
part,
stand
the
test
of
Nme.
Unfortunately,
this
is
not
always
the
case.”
• Geong
data
out
sooner
in
a
form
where
they
can
be
exposed
to
many
eyes
and
many
analyses
may
allow
us
to
expose
errors
and
develop
bePer
metrics
to
evaluate
the
validity
of
data
Begley
and
Ellis,
29
MARCH
2012
|
VOL
483
|
NATURE
|
531
54. • Every
resource
is
resource
limited:
few
have
enough
Nme,
money,
staff
or
experNse
required
to
do
everything
they
would
like
– If
the
market
can
support
11
MRI
databases,
fine
– Some
consolidaNon,
coordinaNon
is
usually
warranted
• Big,
broad
and
messy
beats
small,
narrow
and
neat
– Without
trying
to
integrate
a
lot
of
data,
we
will
not
know
what
needs
to
be
done
– Progressive
refinement;
addiNon
of
complexity
through
layers
• Be
flexible
and
opportunisNc
– A
single
opNmal
technology/container
for
all
types
of
scienNfic
data
and
informaNon
does
not
exist;
technology
is
changing
• Think
globally;
act
locally:
– No
source,
not
even
NIF,
is
THE
source;
we
are
all
a
source
– Think
about
interoperaNon
from
the
incepNon
55.
56. Regional
part
of
nervous
system
ParcellaNon
scheme
parcel
ParcellaNon
scheme
parcel
Single
species
or
strain
ParcellaNon
scheme
Precise
definiNon
Technique
INCF
Task
Force:
Alan
Rutenberg,
Seth
Ruffins
FuncNonal
part
of
nervous
system
ParNally
overlaps
Taxon
rank
General
hierarchy
57. 1200
parts
of
nervous
system
characterized
(mostly)
according
to
CUMBO
terms
1200
“parcels”
from
individual
atlases/papers
700
neurons
280
via
Neuron
Registry
Available
via
NIF
vocabulary
services
(REST)
Hosted
in
a
Virtuoso
triple
store
via
SPARQL