SlideShare ist ein Scribd-Unternehmen logo
1 von 78
Hinges and Loops? -- Data as Evidence
        I-School UC, Berkeley
         November 13, 2009
  “Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.”   http://en.wikipedia.org/wiki/Cavendish_experiment
“Othello:
                  ‘Villain: be sure thou prove my love
   a whore;                        Be sure of it; give
   me the ocular proof;
   Or by the worth of man’s eternal soul,
                             Thou hadst been better
   born a dog                                  Than
   answer my naked wrath!
Iago:                              ‘Is’t come to this?’
Othello:
                   ‘Make me to see‘t ; or at the least so
   prove it,                    That the probation bear
   no hinge nor loop                            To hang
   doubt on; or woe upon thy life!’ “

                   The Tragedy of Othello: The Moor of Venice (Act 3 Scene 3)
“So the universe has always appeared to the natural mind
  as a kind of enigma, of which the key must be sought in the shape of
                                        some illuminating or power-
     bringing word or name.                                 That word
   names the universe's principle, and to possess it is, after a fashion,
       to possess the universe itself 'God,' 'Matter,' 'Reason,’ 'the
                           Absolute,’ ‘Energy,’
                     are so many solving names.
    You can rest when you have them. You are at the end of your
                          metaphysical quest.”




       William James. "What Pragmatism Means". Lecture 2 in Pragmatism: A new name for some old ways of
                           thinking. New York: Longman Green and Co (1922): 52-52.
                          http://www.archive.org/stream/pragmatismnewnam00jame
Internet Archive:
http://www.archive.org/stream/pragmatismnewnam00jame

Note Date of Publiction: 1922
Clear definitions are good (!)
We should not reflexively rely on metaphysical
  “solving” / “power-bringing” words…
ADD to James’s list?:
“Knowledge”
“Information”
“Data” ???
“Data” ?
Usage
Data: The word data is the Latin plural of datum, neuter past participle
  of dare, "to give", hence "something given".

“ Data leads a life of its own quite independent of datum, of which it
   was originally the plural. It occurs in two constructions: as a plural
   noun (like earnings), taking a plural verb and plural modifiers (as
   these, many, a few) but not cardinal numbers, and serving as a
   referent for plural pronouns; and as an abstract mass noun (like
   information), taking a singular verb and singular modifiers (as this,
   much, little), and being referred to by a singular pronoun. Both
   constructions are standard. The plural construction is more common
   in print, perhaps because the house style of some publishers
   mandates it.”
                      The Merriam-Webster Online Dictionary
                http://www.merriam-webster.com/dictionary/data
“Data” ? [technological]
“…’data’ are defined as any information that can be stored in
  digital form and accessed electronically, including, but not
  limited to, numeric data, text, publications, sensor streams,
  video, audio, algorithms, software, models and simulations,
  images, etc.” -- Program Solicitation 07-601
   “Sustainable Digital Data Preservation and Access Network Partners (DataNet)”



Taken in this broadest possible sense, “data” are thus simply
  electronic coded forms of information. And virtually anything
  can be represented as “data” so long as it is electronically
  machine-readable.
“The digital universe in 2007 — at 2.25 x 1021bits (281
       exabytes or 281 billion gigabytes) — was 10% bigger than we
       thought. The resizing comes as a result of faster growth in
       cameras, digital TV shipments, and better understanding of
       information replication.
    “By 2011, the digital universe will be 10 times the size it was in
       2006.
    “As forecast, the amount of information created, captured, or
       replicated exceeded available storage for the first time in
       2007. Not all information created and transmitted gets
       stored, but by 2011, almost half of the digital universe will not
       have a permanent home.
    “Fast-growing corners of the digital universe include those
       related to digital TV, surveillance cameras, Internet access in
       emerging countries, sensor-based applications, datacenters
       supporting “cloud computing,” and social networks.
The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive
Summary. IDC Information and Data, March, 2008
http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
“The diversity of the digital universe can be seen in
  the variability of file sizes, from 6 gigabyte
  movies on DVD to 128-bit signals from RFID tags.
  Because of the growth of VoIP, sensors, and RFID,
  the number of electronic information
  “containers” — files, images, packets, tag
  contents — is growing 50% faster than the
  number of gigabytes. The information created in
  2011 will be contained in more than 20
  quadrillion — 20 million billion — of such
  containers, a tremendous management
  challenge for both businesses and consumers.
  alone. “
The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information
Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008
http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
“Data” [epistemic]
“Measurements, observations or descriptions of
 a referent -- such as an individual, an event, a
 specimen in a collection or an
 excavated/surveyed object -- created or
 collected through human interpretation
 (whether directly “by hand” or through the
 use of technologies)”
                 -- AnthroDPA Working Group on Metadata (May, 2009)
“The General Definition of Information (GDI)”

σ is an instance of information, understood as
  semantic content, if and only if:
•     (GDI.1) σ consists of one or more data;
•          (GDI.2) the data in σ are well-formed;
•          (GDI.3) the well-formed data in σ are meaningful.

    Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information”
              (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophy
              http://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]
“…with the corollary assumptions that they are
    objective -- that is, not conditioned by
    subjective perspectives
  and
  invariant – that is, true under all circumstances.”
                                              -- Draft GBIF DPFTG Report, 2009


SEE: R. Nozick, Invariances: The Structure of the Objective World, Harvard
University Press, Cambridge, 2001. AND L. Daston and P. Galison, Objectivity,
Zone Books, NY, 2007.
The Diaphoric Definition of Data (DDD):

“According to GDI, information cannot be dataless but, in the simplest case, it can consist of a single
    datum. Now a datum is reducible to just a lack of uniformity (diaphora is the Greek word for
    “difference”), so a general definition of a datum is:
  The Diaphoric Definition of Data (DDD): A datum is a putative fact regarding some difference or
    lack of uniformity within some context.
“Depending on philosophical inclinations, DDD can be applied at three levels:
 1. data as diaphora de re, that is, as lacks of uniformity in the real world out there. There is no
    specific name for such “data in the wild”. A possible suggestion is to refer to them as dedomena
    (“data” in Greek; note that our word “data” comes from the Latin translation of a work by
    Euclid entitled Dedomena). Dedomena are not to be confused with environmental data (see
    section 1.7.1). They are pure data or proto-epistemic data, that is, data before they are
    epistemically interpreted. As “fractures in the fabric of being” they can only be posited as an
    external anchor of our information, for dedomena are never accessed or elaborated
    independently of a level of abstraction (more on this in section 3.2.2). They can be
    reconstructed as ontological requirements, like Kant's noumena or Locke's substance: they are
    not epistemically experienced but their presence is empirically inferred from (and required by)
    experience. Of course, no example can be provided, but dedomena are whatever lack of
    uniformity in the world is the source of (what looks to information systems like us as) as data,
    e.g., a red light against a dark background. Note that the point here is not to argue for the
    existence of such pure data in the wild, but to provide a distinction that (in section 1.6) will help
    to clarify why some philosophers have been able to accept the thesis that there can be no
    information without data representation while rejecting the thesis that information requires
    physical implementation; …”
The Diaphoric Definition of Data (DDD): (cont.)


  “2. data as diaphora de signo, that is, lacks of uniformity between (the perception of) at least two
       physical states, such as a higher or lower charge in a battery, a variable electrical signal in a
       telephone conversation, or the dot and the line in the Morse alphabet; and
  3. data as diaphora de dicto, that is, lacks of uniformity between two symbols, for example the
       letters A and B in the Latin alphabet.”




Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information”
(First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophy
http://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]
“Evidence”?



    “Data having probative value and authority”?
i.e. well supported by scientific logic and considered
                  technically valid
Policy Formation
 and Decision Making
Poder Politico y Conocimiento
                          Alto


                                                                                              ???
                                  Políticos
Responsabilidad y Poder




                                                 Administradores
                                                   o Gestores

                                                                   Analistas-
                                                                   Técnicos


                                                                                Científicos

                                                                                              Alto
  Bajo
                                 Conocimiento (en términos científicos-occidentales)

                                                          (Sutton, 1999)

                  From: Organizaciones que aprenden, paises que aprenden: lecciones y AP en Costa Rica by Andrea
                  Ballestero Directora ELAP
Wednesday, January 21st, 2009 at 12:00 am
MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND
AGENCIES
SUBJECT:    Freedom of Information Act

A democracy requires accountability, and accountability requires transparency. As Justice Louis
Brandeis wrote, "sunlight is said to be the best of disinfectants." In our democracy, the Freedom
of Information Act (FOIA), which encourages accountability through transparency, is the most
prominent expression of a profound national commitment to ensuring an open Government. At the
heart of that commitment is the idea that accountability is in the interest of the Government and
the citizenry alike.

The Freedom of Information Act should be administered with a clear presumption: In the face of
doubt, openness prevails. The Government should not keep information confidential merely
because public officials might be embarrassed by disclosure, because errors and failures might
be revealed, or because of speculative or abstract fears. Nondisclosure should never be based on
an effort to protect the personal interests of Government officials at the expense of those they are
supposed to serve. In responding to requests under the FOIA, executive branch agencies
(agencies) should act promptly and in a spirit of cooperation, recognizing that such agencies are
servants of the public.

All agencies should adopt a presumption in favor of disclosure, in order to renew their
commitment to the principles embodied in FOIA, and to usher in a new era of open Government.
The presumption of disclosure should be applied to all decisions involving FOIA…[clip]

Barack Obama

                            http://www.whitehouse.gov/the_press_office/Freedom_of_Information_Act/
“Declaration of Scientific Principles”
        in “The Commonwealth of Science”
“7. The pursuit of scientific inquiry demands
  complete intellectual freedom. And
  unrestricted international exchange of
  knowledge…“


      from “The Commonwealth of Science ” Nature No.3753
      October 4, 1941.
August 4, 2009: the White House issued a
  memorandum stating unequivocally “Sound
  science should inform policy decisions”




“Science and Technology Priorities for the FY2011 Budget,” PR Orszag and
JP Holdren August 4, 2009, Memorandum for the Heads of Executive
Departments and Agencies, M-09-27.
http://www.whitehouse.gov/omb/assets/memoranda_fy2009/m09-27.pdf
The $3.6 billion Large Hadron Collider
(LHC) will sample and record the
results of up to 600 million proton
collisions per second, producing
roughly 15 petabytes (15 million
gigabytes) of data annually in search of
new fundamental particles. To allow
thousands of scientists from around the
globe to collaborate on the analysis of
these data over the next 15 years (the
estimated lifetime of the LHC), tens of
thousands of computers located around
the world are being harnessed in a
distributed computing network called
the Grid. Within the Grid, described as
the most powerful supercomputer
system in the world, the avalanche of
data will be analyzed, shared, re-
purposed and combined in innovative
new ways designed to reveal the
secrets of the fundamental properties
of matter.

LHC source:
http://public.web.cern.ch/public/en/LHC/L

Source:
http://public.web.cern.ch/Public/en/LHC/L
“The Legacy of GenBank: The
DNA Sequence Database That
 Set a Precedent,” 1663: Los
     Alamos Science and
Technology Magazine August
            2008
http://www.lanl.gov/news/1663/imag
“The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los
                Alamos Science and Technology Magazine August 2008
                http://www.lanl.gov/news/1663/images/aug08/22lg.jpg
The (US) NCAR
            Research Data Archive (RDA)
    “The NCAR Research Data Archive (RDA) is a comparatively small
       (currently 246 TB, less than 5% of the MSS [Mass Storage System] total
       size), but very important, part of the MSS stored data. The RDA has
       been curated by the staff in the Computational and Information
       Systems Laboratory for over 40 years, [emphasis added] and as such
       contains reference datasets used by large numbers of scientists.
       The RDA contents are long-term atmospheric (surface and upper
       air) and oceanographic observations, grid analyses of observational
       datasets, operational weather prediction model output, reanalyses,
       satellite derived datasets, and ancillary datasets, such as
       topography/bathymetry, vegetation, and land use. The RDA is not
       a static collection; it is now over 580 datasets with about 100
       routinely updated and 10-20 new ones added each year. “


C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge
                        sharing ,” from the 4th International Digital Curation Conference December 2008, page 5.
       www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
NCAR Research Data Archive (RDA)




C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge
                        sharing ,” from the 4th International Digital Curation Conference December 2008 , page 7.
      www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
http://www.ncdc.noaa.gov/img/climate/globalwarming/ar4-fig-3-9.gif
Facebook?
Facebook, for example, uses more than 1
  petabyte of storage space to manage its users’
  40 billion photos. (A petabyte is about 1,000
  times as large as a terabyte, and could store
  about 500 billion pages of text.)

  Training to Climb an Everest of Digital Data
  By ASHLEE VANCE NYT Published: October 11,
  2009


    http://www.nytimes.com/2009/10/12/technology/12data.html?_r=1
“Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.”   http://en.wikipedia.org/wiki/Cavendish_experiment
http://www.newscientist.com/articleimages/mg12016390.100/0-four-fundamental-forces.html
“Experiments to determine the density of the earth,” by Henry Cavendish, ESQ., F.R.S. AND A.S. Read
  June 21, 1798   (From the Philosophical Transactions of the Royal Society of London for the year
                                   1798, Part II. , pp. 469-526)




           From: http://www.archive.org/details/lawsofgravitatio00mackrich
2-d_soil_temps.csv
               surface, and sub-surface soil temperatures (at 2cm and 8cm depths) measured at one location for a few days in order to
                        calibrate a model of temperature propagation. Surface temperature was measured with an infrared thermometer,
                        subsurface temperatures with a thermocouple.
               ----------------------------
               5-minute_light_data_for_4_continuous_days_plus_reference.xls
               PPF (photosynthetic photon flux = photosynthetically active radiation 400-700nm) measured with an array of photodiodes
                        calibrated to a Licor sensor, along a linear transect for a few days. used to get an idea of how much light plants along
                        the transect are receiving.
               ----------------------------


  DATA         CO2_of_air_at_different_heights_July_9.xls
               concentration of CO2 in the air during the evening for one day, measured with a Licor infrared gas analyzer and a series of
                        relays and tubes with a pump. used to examine the gradient of CO2 coming from the soil when the air is still during the
                        evening.



  SETS
               ----------------------------
               Fern_light_response.xls
               Light response curves for bracken ferns, measured with a Licor photosynthesis system. Fronds are exposed to different light
                        levels and their instantaneous photosynthesis and conductance is measured. used in conjunction with the induction
                        data (below) for physiological characterization of the ferns.
               ----------------------------
               La_Selva_species_photosyntheis_table.xls
               incomplete data set on instantaneous photosynthesis rates for various tropical understory and epiphytic species grown in a
                        shade house in Costa Rica.
               ----------------------------

   some        manzanita_sapflow_12-5-07_to_7-7-08.xls
               instantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple
                        branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground

 examples
                        measures of root grown and CO2 production.
               ----------------------------
               moisture_release_curves.xls


with “native
               percentage of water content, water potential (in MegaPascals) and temperature of soil samples, measured in the laboratory
                        for calibration of water content with water potential. soil is from the James Reserve in California.
               ----------------------------
               Photosynthetic_induction.xls

metadata”      2
               O
               C
               .
               5
               3
               v
               l
               d
               n
               y
               h
               p
               f
               s
               r
               u
               o
               c
               -
               e
               m
               i
               t
               a
               �




                        m/2/s and light level is probably 1000 micromoles. used to determine physiological characteristics of bracken ferns.
               ----------------------------
               run_2_24-h_data_for_mesh.xls
               measurements of micrometeorological parameters on a moving shuttle, going from a clearing across a forest edge and into the
                        forest for about 30 meters. Pyronometers facing up and down, pyrgeometer facing up and down, PAR, air temperature,
                        relative humidity. Also data from a station fixed in the clearing and some derived variables calculated. used for
                        examining edge effects in forests.
               ----------------------------
               Segment_of_wallflower_compare_colorspaces_blur.xls
               pixel counts from images of wallflowers that were segmented into flower/not-flower under different color spaces.
                        segmentation was made using a probability matrix of hand-segmented images. used to automatically count flowers in
                        images collected after this training data was collected (and used to determine the best color space for this task).
manzanita_sapflow_12-5-07_to_7-7-08.xls
instantaneous sap flow data (as temperature differences on a constant temperature heat
dissipation probe) for multiple branches of Manzanita, collected with a datalogger.  used to
correlate physiological activity with below-ground measures of root grown and CO2 production.



sbid battery datetime heater_voltage Manz1Sap1 Manz1Sap2 Manz1Sap3 Manz1Sap4 Manz2Sap5 Manz2Sap6 Manz2Sap7 Manz3Sap10 Manz3Sap8 Manz3Sap9 Manz4Sap11 timestamp Datagap Julian


2         12.365    1196796112          2018.8     0.5585    0.51029   0.55517   0.54354   0.6067     0.52858   0.55351   0.59008   0.59506   0.60337    0.56514   12/4/07 11:21       4.47351
3         12.348    1196796232          2017.9     0.55682   0.51028   0.5535    0.54352   0.60669    0.52857   0.55017   0.59007   0.59505   0.60336    0.56513   12/4/07 11:23   0   4.47490
4         12.357    1196796352          2018.6     0.55514   0.51027   0.55348   0.54351   0.60501    0.52855   0.55016   0.59005   0.59504   0.60501    0.56512   12/4/07 11:25   0   4.47628
5         12.354    1196796472          2017.6     0.55514   0.51026   0.55181   0.5435    0.60334    0.52855   0.54849   0.59004   0.59503   0.60334    0.56511   12/4/07 11:27   0   4.47767
6         12.334    1196796592          2018.3     0.55347   0.51026   0.55015   0.5435    0.60333    0.52854   0.54682   0.59004   0.59502   0.605      0.56511   12/4/07 11:29   0   4.47906
7         12.34     1196796712          2018.5     0.55014   0.50859   0.55014   0.54349   0.60332    0.53019   0.54349   0.59003   0.59501   0.60498    0.56676   12/4/07 11:31   0   4.48045
8         12.337    1196796832          2017.8     0.55013   0.50692   0.55013   0.54348   0.60332    0.53019   0.54182   0.59002   0.59501   0.60498    0.56675   12/4/07 11:33   0   4.48184
9         12.328    1196796952          2017.5     0.5468    0.50691   0.5468    0.54347   0.60331    0.53018   0.53849   0.59001   0.595     0.60497    0.56674   12/4/07 11:35   0   4.48323
10        12.323    1196797072          2017       0.54679   0.50524   0.54679   0.54347   0.59998    0.53017   0.53682   0.59      0.59499   0.60496    0.56674   12/4/07 11:37   0   4.48462
11        12.328    1196797192          2018.9     0.54679   0.50191   0.54512   0.5418    0.59665    0.53017   0.53349   0.59      0.59498   0.60496    0.56673   12/4/07 11:39   0   4.48601
12        12.319    1196797312          2017.7     0.54345   0.49857   0.54178   0.54178   0.59663    0.53015   0.53015   0.58998   0.5933    0.60327    0.56671   12/4/07 11:41   0   4.48740
13        12.311    1196797432          2017.3     0.54343   0.4969    0.54011   0.54177   0.59661    0.53014   0.52848   0.58997   0.59329   0.6016     0.5667    12/4/07 11:43   0   4.48878
14        12.316    1196797552          2018.6     0.5401    0.49357   0.53678   0.54176   0.59328    0.53013   0.5268    0.58995   0.59328   0.60325    0.56669   12/4/07 11:45   0   4.49017
15        12.31     1196797672          2016.8     0.53844   0.4919    0.53511   0.54176   0.59494    0.53013   0.52514   0.58995   0.59328   0.60325    0.56503   12/4/07 11:47   0   4.49156
16        12.31     1196797792          2017.1     0.53676   0.48856   0.53343   0.54174   0.59326    0.53011   0.5218    0.58993   0.59326   0.60323    0.56501   12/4/07 11:49   0   4.49295
17        12.31     1196797912          2017.1     0.53342   0.48523   0.5301    0.54173   0.59324    0.5301    0.51846   0.58826   0.59324   0.60321    0.56499   12/4/07 11:51   0   4.49434
18        12.301    1196798031          2017.5     0.53174   0.48521   0.52842   0.53839   0.59156    0.53008   0.51845   0.58824   0.59323   0.6032     0.56498   12/4/07 11:53   0   4.49573
19        12.301    1196798151          2016.3     0.53007   0.48188   0.52509   0.53838   0.59155    0.53007   0.51512   0.58823   0.59321   0.60152    0.5633    12/4/07 11:55   0   4.49712

20        12.303    1196798271          2016.6     0.5284    0.47855   0.52175   0.53837   0.59154    0.5284    0.5151    0.58821   0.59154   0.60151    0.56163   12/4/07 11:57   0   4.49851




                                                                                   Datum: “0.59998”
“Jim Gray on eScience: A Transformed Scientific Method” T. Hey, S. Tansley, and K.Tolle (eds)| Microsoft
 Research Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on
                                               January 11, 2007
http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf
“Reanalyses” [or Meta-Analyses ]
    “Atmospheric reanalyses are a main feature within the RDA and were
       intended to be, and have become, a very valuable data resource
       for a wide variety of climate and weather studies. By combining
       many types of atmospheric observations with advanced data
       assimilation and forecast models a “best possible” 3D estimate of
       the atmospheric state over extended time periods is achieved.

    “Reanalyses are supported by many historical data sources that have
      been curated over time. As an illustration the major sources of
      atmospheric profile data include wind only soundings beginning in
      1920 (Figure 2). These are augmented with soundings of
      temperature, humidity, and wind beginning in 1948. “




C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge
                        sharing ,” from the 4th International Digital Curation Conference December 2008, page 6.
      www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
Fundamental Questions:
• Data Specification – scientific logic of data
  definition
• Data Creation – specification of methodology
• Data Integrity – preservation -- “chain of
  custody” “Chain of custody refers to the chronological
   documentation or paper trail, showing the seizure, custody,
   control, transfer, analysis, and disposition of evidence,
   physical or electronic.”
[ http://en.wikipedia.org/wiki/Chain_of_custody [clipped 11/12/09 10:30pm PST]

• Data transformations
     – Logic
     – Competence /Technical Performance / Execution
“Keeping Raw Data in Context”
“…any initiative to share raw clinical research data must also pay close attention to sharing clear
   and complete information about the design of the original studies. Relying on journal articles
   for study design information is problematic, for three reasons. First, journal articles often
   provide insufficient detail when describing key study design features such as randomization
   (1) and intervention details (2). Second, some data sets may come from studies with no
   publications [only 21% of oncology trials registered in ClinicalTrials.gov before 2004 and
   completed by September 2007 were published (3)]. Finally, investigators cannot reliably
   search journal articles for methodological concepts like “double blinding” or “interrupted
   time series,” crucial concepts for proper interpretation of the data. A mishmash of non-
   standardized databases of raw results and unevenly reported study designs is not a strong
   foundation for clinical research data sharing. “


“ We believe that the effective sharing of clinical research data requires the establishment of an
   interoperable federated database system that includes both study design and results data. A
   key component of this system is a logical model of clinical study characteristics in which all
   the data elements are standardized to controlled vocabularies and common ontologies to

    facilitate cross-study comparison and synthesis. “


I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713.
“Increasing levels of coordinate digit noise
            associated with repeated projection transformations”




Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information
Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23,
2005.
“It is well known that cartographic coordinates stored in double precision are
far more precisely specified than is merited by their accuracy, even for
highly-accurate global datasets. Far more coordinate digit places are stored
for the sake of avoiding machine error than are needed to define the location
of map objects within the necessary tolerances for both absolute and relative
accuracies.”


“A careful look at the coordinate digits stored as double precision variables
in a GIS yields a variety of interesting patterns that are a result of previous
machine error, rounding error, measurement error, and so forth. Any
slight cartographic alteration (rotation/skewing, clipping/sub-setting,
reprojecting, etc.) can add noise into the coordinate and can be used to
characterize a vector dataset.”




 Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information
 Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23,
 2005.
GRIDS




 Data
                                   International
 Centers
                                   Collaborative
                                   Research Effort



Individual
                    National Disciplinary Initiatives
Libraries

              Cooperative Projects

Local /
             Individuals
Personal
Archiving

              “Small Science”                           “BIG Science”
“Small Science”?
The “small science,” independent investigator approach traditionally has
characterized a large area of experimental laboratory sciences, such as
chemistry or biomedical research, and field work and studies, such as
biodiversity, ecology, microbiology, soil science, and anthropology. The data
or samples are collected and analyzed independently, and the resulting data
                                         independently
sets from such studies generally are heterogeneous and unstandardized, with
                                                           unstandardized
few of the individual data holdings deposited in public data repositories or
openly shared.
        The data exist in various twilight states of accessibility, depending on
                                                     accessibility
the extent to which they are published, discussed in papers but not revealed, or
just known about because of reputation or ongoing work, but kept under
absolute or relative secrecy. The data are thus disaggregated components of
an incipient network that is only as effective as the individual transactions
that put it together. Openness and sharing are not ignored, but they are not
            together
necessarily dominant either. These values must compete with strategic
considerations of self-interest, secrecy, and the logic of mutually beneficial
exchange, particularly in areas of research in which commercial applications
are more readily identifiable.

The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie
M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the
Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific
Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8
Maria Sibylla Merian Metamorphosis
insectorum Surinamensium
(Metamorphosis of the Insects of
Surinam) Amsterdam, 1705, figure 46
Hand-colored engraving (123)




                   http://www.loc.gov/exhibits/dres/dre123.jpg
DARWIN




http://darwin-
online.org.uk/converted/published/1975_NaturalSelection_F15
83/1975_NaturalSelection_F1583_fig03.jpg                          http://www.nyu.edu/projects/materialworld/images/1_
                                                                           Darwin%20Tree%20B%2036.jpg
FIELD NOTES
FROM THE AMERICAN MUSEM CONGO EXPEDITION 1909-1915

            http://diglib1.amnh.org/cgi-bin/database/index.cgi
http://diglib1.amnh.org/galleries/bats/taphozous_mauritianus.html
Rheinardia ocellata, the Crested Argus. Photographed at night by an
automatic camera-trap in the Ngoc Linh foothills (Quang Nam Province).
             Courtesy AMNH Center for Biodiversity and Conservation
By Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009           SEE:

      http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=cse




    http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14
How many data sources contributed to this analysis?
The “small science,” independent investigator approach traditionally has
characterized a large area of experimental laboratory sciences, such as
chemistry or biomedical research, and field work and studies, such as
biodiversity, ecology, microbiology, soil science, and anthropology. The data
or samples are collected and analyzed independently, and the resulting data
                                         independently
sets from such studies generally are heterogeneous and unstandardized, with
                                                           unstandardized
few of the individual data holdings deposited in public data repositories or
openly shared.
        The data exist in various twilight states of accessibility, depending on
                                                     accessibility
the extent to which they are published, discussed in papers but not revealed, or
just known about because of reputation or ongoing work, but kept under
absolute or relative secrecy. The data are thus disaggregated components of
an incipient network that is only as effective as the individual transactions
that put it together. Openness and sharing are not ignored, but they are not
            together
necessarily dominant either. These values must compete with strategic
considerations of self-interest, secrecy, and the logic of mutually beneficial
exchange, particularly in areas of research in which commercial applications
are more readily identifiable.

The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie
M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the
Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific
Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8
GRIDS




 Data
                                   International
 Centers
                                   Collaborative
                                   Research Effort



Individual
                    National Disciplinary Initiatives
Libraries

              Cooperative Projects

Local /
             Individuals
Personal
Archiving

              “Small Science”                           “BIG Science”
Green, T (2009), “We Need Publishing Standards for
  Datasets and Data Tables”, OECD Publishing White Paper,
       OECD Publishing. doi: 10.1787/603233448430
        http://dx.doi.org/10.1787/603233448430
http://ocde.p4.siteinternet.com/publications/doifiles/publishin
                  g-standards-data-2009.pdf
Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper,
          OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430

         http://ocde.p4.siteinternet.com/publications/doifiles/publishing-standards-data-2009.pdf
What does “Full Life-Cycle” Data
    Management Mean ?
US NSF “DataNet” Program
            “the full data preservation and access lifecycle”

      •   “acquisition”
      •   “documentation”
      •   “protection”
      •   “access”
      •   “analysis and dissemination”
      •   “migration”
      •   “disposition”
“Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07-
 601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information
                                           Science & Engineering
Incentives?
How do we Incentivize Change ?
•   Individuals
•   Professions / Disciplines
•   Organizations
•   Institutions (Universities, Research Institutes,
    Museums, Gardens, Herbaria, Aquariums, Zoos)
•   “Memory Institutions” (Libraries, Archives)
•   Governments
•   Funders / Sponsors
•   Publishers!
Individual’s willingness to share:
            the Core functions of Scholarly Communication

• “Registration, which allows claims of precedence for a
  scholarly finding.
• “Certification, which establishes the validity of a registered
  scholarly claim.
• “Awareness, which allows participants in the scholarly system
  to remain aware of new claims and findings.
• “Archiving, which preserves the scholarly record over time.
• “Rewarding, which rewards participants for their
  performance in the communication system based on metrics
  derived from that system.



   Roosendaal, H., Geurts, P in Cooperative Research Information Systems in Physics (Oldenburg, Germany, 1997).
Professional / Disciplinary
       Incentives?
• Norms and standards for sharing vary by discipline
• In “big science” (astrophysics / astronomy /
  meteorology / oceanography / genomics) sharing is
  expected (if not required) and contributions to a
  common fund of knowledge are assumed (See also:
  GENBANK )
   – Standards are relatively clear
   – Mechanisms for sharing are well-developed
   – Collective / collaborative authorship is commonplace
• In “small science” such norms are weaker
Small Science: Data Deposit and Access

• Data are typically held in many formats
• Discovery of data is very weakly supported by
  standards-development
• Access to and use of data are highly variable
• [ However progress has been made respecting
  museum specimen data in the past 20 years [SEE for
  ex. : GBIF and many allied projects] ]
• Some progress has been made respecting
  observational and other data
• Ecological and conservation field data remain highly
  problematic
Some suggestions for action include:

    government agencies and private foundations must both set strict
    requirements for effective sharing – with serious penalties (such as
    disqualification for future research funding) for failures to share;
•   peer review processes must include rigorous scrutiny of past histories of
    sharing and must require state-of-the-art planning for sharing (not simply
    a promise to “put data up on the Web” ];
•   negotiations for “overhead” (“indirect costs”) compensation from funders
    must include examination of digital infrastructure adequate for sharing
    and maintenance of data;
•   accreditation bodies for educational institutions and museums must start
    to require demonstrated evidence of capacity to support digital access
    and maintenance of data;
•   professional societies and professional disciplines must begin to require
    evidence of effective sharing of data in evaluating credentials for hiring,
    promotion and tenure;
http://www.mikero.com/blog/2009/02/20/more-darwin
         http://www.zazzle.com/darwin2009
From: Tom Moritz [mailto:tom.moritz@gmail.com]
Sent: Thursday, November 12, 2009 1:46 AM
To: Donat Agosti
Subject: Snapple Real Fact #134: " An ant can lift 50 times its own weight. ”

Is this true?
Tom
________________________________________________
From: Donat Agosti <agosti@amnh.org>
Date: Wed, Nov 11, 2009 at 8:03 PM
Subject: RE: Snapple Real Fact #134: " An ant can lift 50 times its own weight. "
To: Tom Moritz tom.moritz@gmail.com

People says so [emphasis added] – but we once looked for the evidence, but
   could not find a scientific paper confirming this.
D
Iobi Ludolfi aliàs Leut-holf dicti
                                                                      Historia Æthiopica, sive Brevis
                                                                      & succincta descriptio regni
                                                                      Habessinorum, quod vulgò
                                                                      malè Presbyteri Iohannis
                                                                      vocatur : 2009 Cambridge
                                                                      University Library




"They [the hippopotami] present the following appearance; four-
  footed, with cloven hooves like cattle; blunt-nosed; with a
  horse's mane, visible tusks, a horse's tail and voice; big as the
  biggest bull. Their hide is so thick that, when it is dried,
  spearshafts are made of it.” Herodotus, The Histories (with an English translation by A. D.
     Godley). Cambridge. Harvard University Press. 1920. LXXI
http://old.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus%3Aabo%3Atlg%2C0016%2C001&query=2%3A71%3A1
     [clipped 11/12/09]
a problem with “evidence”…




“…the great trouble with the world was that
  which survived was held in hard evidence as
  to past events. A false authority clung to what
  persisted, as if those artifacts of the past
  which had endured had done so by some act
  of their own will.”
                          -- Cormac McCarthy The Crossing
“Πάντα ῥ εῖ καὶ οὐ δὲ ν μένει”
Heraclitus: “Everything flows, nothing stands still.”



           All data is dynamic
From examination of elephants’
                                        skulls the early Greeks deduced
                                        that a species of humanoid
                                        Cyclops existed…

                                        (SEE -- for example -- The
                                        Odyssey and Ulysses encounter
                                        with Polyphemus on the island of
                                        Sicily… )




http://www.amnh.org/exhibitions/mythiccreatures/land/greek.php
Another deduction from the evidence of narwhal tusks…
“In the Middle Ages, narwhal tusks were widely thought to be unicorn horns
with magical, curative properties. Indeed, cups made from narwhal tusks
(above) were thought to neutralize poisons and were highly valued. “

http://www.amnh.org/exhibitions/mythiccreatures/land/unicorns.php
Kirtland’s Warbler / Abaco Island, The
              Bahamas
“NATIVE”
                    METADATA


 DEAD HARBOR SEAL
and
            5
    CALIFORNIA
    CONDORS !!!

Weitere ähnliche Inhalte

Ähnlich wie University of California, Berkeley: iSchool Nov, 2009

Transmission Of Multimedia Data Over Wireless Ad-Hoc Networks
Transmission Of Multimedia Data Over Wireless Ad-Hoc NetworksTransmission Of Multimedia Data Over Wireless Ad-Hoc Networks
Transmission Of Multimedia Data Over Wireless Ad-Hoc NetworksJan Champagne
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebRutger Vos
 
What is information: And what do we do about it?
What is information:  And what do we do about it?What is information:  And what do we do about it?
What is information: And what do we do about it?Johan Koren
 
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileLEARN Project
 
Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics? Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics? PhiloWeb
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDan Brickley
 
Sensory transformation
Sensory transformationSensory transformation
Sensory transformationKarlos Svoboda
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSilvia Puglisi
 
Paradigms: International Council of Museums (ICOM) Committee on Documentation...
Paradigms: International Council of Museums (ICOM) Committee on Documentation...Paradigms: International Council of Museums (ICOM) Committee on Documentation...
Paradigms: International Council of Museums (ICOM) Committee on Documentation...Tom Moritz
 
Knowledge = Information + Context
Knowledge = Information + ContextKnowledge = Information + Context
Knowledge = Information + ContextStefan Gradmann
 
What is information: And what do we do about it?
What is information:  And what do we do about it?What is information:  And what do we do about it?
What is information: And what do we do about it?Johan Koren
 
What is information?
What is information?What is information?
What is information?Johan Koren
 
US Office of Personnel Management: Notes on "Big Data"
US Office of Personnel Management: Notes on  "Big Data" US Office of Personnel Management: Notes on  "Big Data"
US Office of Personnel Management: Notes on "Big Data" Tom Moritz
 
Associativity and other Wurban Things
Associativity and other Wurban ThingsAssociativity and other Wurban Things
Associativity and other Wurban ThingsMonnoo
 
Moritz esip2011
Moritz esip2011Moritz esip2011
Moritz esip2011Tom Moritz
 
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017kjanowicz
 
Discover Data Portal
Discover Data PortalDiscover Data Portal
Discover Data PortalTom Loughran
 

Ähnlich wie University of California, Berkeley: iSchool Nov, 2009 (20)

Transmission Of Multimedia Data Over Wireless Ad-Hoc Networks
Transmission Of Multimedia Data Over Wireless Ad-Hoc NetworksTransmission Of Multimedia Data Over Wireless Ad-Hoc Networks
Transmission Of Multimedia Data Over Wireless Ad-Hoc Networks
 
DNA Information
DNA InformationDNA Information
DNA Information
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic Web
 
What is information: And what do we do about it?
What is information:  And what do we do about it?What is information:  And what do we do about it?
What is information: And what do we do about it?
 
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of Chile
 
Topical_Facets
Topical_FacetsTopical_Facets
Topical_Facets
 
Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics? Extending the Mind with Cognitive Prosthetics?
Extending the Mind with Cognitive Prosthetics?
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classification
 
Sensory transformation
Sensory transformationSensory transformation
Sensory transformation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
 
Paradigms: International Council of Museums (ICOM) Committee on Documentation...
Paradigms: International Council of Museums (ICOM) Committee on Documentation...Paradigms: International Council of Museums (ICOM) Committee on Documentation...
Paradigms: International Council of Museums (ICOM) Committee on Documentation...
 
Knowledge = Information + Context
Knowledge = Information + ContextKnowledge = Information + Context
Knowledge = Information + Context
 
What is information: And what do we do about it?
What is information:  And what do we do about it?What is information:  And what do we do about it?
What is information: And what do we do about it?
 
What is information?
What is information?What is information?
What is information?
 
US Office of Personnel Management: Notes on "Big Data"
US Office of Personnel Management: Notes on  "Big Data" US Office of Personnel Management: Notes on  "Big Data"
US Office of Personnel Management: Notes on "Big Data"
 
Associativity and other Wurban Things
Associativity and other Wurban ThingsAssociativity and other Wurban Things
Associativity and other Wurban Things
 
Moritz esip2011
Moritz esip2011Moritz esip2011
Moritz esip2011
 
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
 
Discover Data Portal
Discover Data PortalDiscover Data Portal
Discover Data Portal
 

Mehr von Tom Moritz

ESA Science Commons
ESA Science CommonsESA Science Commons
ESA Science CommonsTom Moritz
 
Marine microbiology
Marine microbiologyMarine microbiology
Marine microbiologyTom Moritz
 
Pelagic Environments and Ecology (3) copy
Pelagic Environments and Ecology (3) copyPelagic Environments and Ecology (3) copy
Pelagic Environments and Ecology (3) copyTom Moritz
 
Pelagic environment and ecology (2)
Pelagic environment and ecology (2) Pelagic environment and ecology (2)
Pelagic environment and ecology (2) Tom Moritz
 
Pelagic Environments and Ecosystems (1)
Pelagic Environments and Ecosystems (1)Pelagic Environments and Ecosystems (1)
Pelagic Environments and Ecosystems (1)Tom Moritz
 
Chaparral and Coastal Scrub Ecology
Chaparral and Coastal Scrub EcologyChaparral and Coastal Scrub Ecology
Chaparral and Coastal Scrub EcologyTom Moritz
 
The Intertidal and Kelp Forests - Pacific Coast
The Intertidal and Kelp Forests  - Pacific CoastThe Intertidal and Kelp Forests  - Pacific Coast
The Intertidal and Kelp Forests - Pacific CoastTom Moritz
 
A Universe of Data
A Universe of DataA Universe of Data
A Universe of DataTom Moritz
 
Climate Change
Climate ChangeClimate Change
Climate ChangeTom Moritz
 
Climate change
Climate changeClimate change
Climate changeTom Moritz
 
The commons???
The commons???The commons???
The commons???Tom Moritz
 
Ecological Society of America Science Commons
Ecological Society of America Science CommonsEcological Society of America Science Commons
Ecological Society of America Science CommonsTom Moritz
 
Epidemiology cholera, ebola, hiv aids
Epidemiology cholera, ebola, hiv aidsEpidemiology cholera, ebola, hiv aids
Epidemiology cholera, ebola, hiv aidsTom Moritz
 
The Human Biome
The Human BiomeThe Human Biome
The Human BiomeTom Moritz
 
Epistemology, ontology, knowledge x
Epistemology, ontology, knowledge xEpistemology, ontology, knowledge x
Epistemology, ontology, knowledge xTom Moritz
 
Ids 330 "Environmental Leadership" Basic Introduction (University of the West)
Ids 330 "Environmental Leadership" Basic Introduction (University of the West)Ids 330 "Environmental Leadership" Basic Introduction (University of the West)
Ids 330 "Environmental Leadership" Basic Introduction (University of the West)Tom Moritz
 
Charles Darwin: The Galapagos Finches and the Emergence of Evolutionary Theory
Charles Darwin: The Galapagos Finches and the Emergence of Evolutionary TheoryCharles Darwin: The Galapagos Finches and the Emergence of Evolutionary Theory
Charles Darwin: The Galapagos Finches and the Emergence of Evolutionary TheoryTom Moritz
 
Trauma and violence
Trauma and violenceTrauma and violence
Trauma and violenceTom Moritz
 
Children and Trauma in the International World (UWest Psych 490 November 7, 2...
Children and Trauma in the International World (UWest Psych 490 November 7, 2...Children and Trauma in the International World (UWest Psych 490 November 7, 2...
Children and Trauma in the International World (UWest Psych 490 November 7, 2...Tom Moritz
 

Mehr von Tom Moritz (20)

ESA Science Commons
ESA Science CommonsESA Science Commons
ESA Science Commons
 
Microbiology
MicrobiologyMicrobiology
Microbiology
 
Marine microbiology
Marine microbiologyMarine microbiology
Marine microbiology
 
Pelagic Environments and Ecology (3) copy
Pelagic Environments and Ecology (3) copyPelagic Environments and Ecology (3) copy
Pelagic Environments and Ecology (3) copy
 
Pelagic environment and ecology (2)
Pelagic environment and ecology (2) Pelagic environment and ecology (2)
Pelagic environment and ecology (2)
 
Pelagic Environments and Ecosystems (1)
Pelagic Environments and Ecosystems (1)Pelagic Environments and Ecosystems (1)
Pelagic Environments and Ecosystems (1)
 
Chaparral and Coastal Scrub Ecology
Chaparral and Coastal Scrub EcologyChaparral and Coastal Scrub Ecology
Chaparral and Coastal Scrub Ecology
 
The Intertidal and Kelp Forests - Pacific Coast
The Intertidal and Kelp Forests  - Pacific CoastThe Intertidal and Kelp Forests  - Pacific Coast
The Intertidal and Kelp Forests - Pacific Coast
 
A Universe of Data
A Universe of DataA Universe of Data
A Universe of Data
 
Climate Change
Climate ChangeClimate Change
Climate Change
 
Climate change
Climate changeClimate change
Climate change
 
The commons???
The commons???The commons???
The commons???
 
Ecological Society of America Science Commons
Ecological Society of America Science CommonsEcological Society of America Science Commons
Ecological Society of America Science Commons
 
Epidemiology cholera, ebola, hiv aids
Epidemiology cholera, ebola, hiv aidsEpidemiology cholera, ebola, hiv aids
Epidemiology cholera, ebola, hiv aids
 
The Human Biome
The Human BiomeThe Human Biome
The Human Biome
 
Epistemology, ontology, knowledge x
Epistemology, ontology, knowledge xEpistemology, ontology, knowledge x
Epistemology, ontology, knowledge x
 
Ids 330 "Environmental Leadership" Basic Introduction (University of the West)
Ids 330 "Environmental Leadership" Basic Introduction (University of the West)Ids 330 "Environmental Leadership" Basic Introduction (University of the West)
Ids 330 "Environmental Leadership" Basic Introduction (University of the West)
 
Charles Darwin: The Galapagos Finches and the Emergence of Evolutionary Theory
Charles Darwin: The Galapagos Finches and the Emergence of Evolutionary TheoryCharles Darwin: The Galapagos Finches and the Emergence of Evolutionary Theory
Charles Darwin: The Galapagos Finches and the Emergence of Evolutionary Theory
 
Trauma and violence
Trauma and violenceTrauma and violence
Trauma and violence
 
Children and Trauma in the International World (UWest Psych 490 November 7, 2...
Children and Trauma in the International World (UWest Psych 490 November 7, 2...Children and Trauma in the International World (UWest Psych 490 November 7, 2...
Children and Trauma in the International World (UWest Psych 490 November 7, 2...
 

University of California, Berkeley: iSchool Nov, 2009

  • 1. Hinges and Loops? -- Data as Evidence I-School UC, Berkeley November 13, 2009 “Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment
  • 2. “Othello: ‘Villain: be sure thou prove my love a whore; Be sure of it; give me the ocular proof; Or by the worth of man’s eternal soul, Thou hadst been better born a dog Than answer my naked wrath! Iago: ‘Is’t come to this?’ Othello: ‘Make me to see‘t ; or at the least so prove it, That the probation bear no hinge nor loop To hang doubt on; or woe upon thy life!’ “ The Tragedy of Othello: The Moor of Venice (Act 3 Scene 3)
  • 3. “So the universe has always appeared to the natural mind as a kind of enigma, of which the key must be sought in the shape of some illuminating or power- bringing word or name. That word names the universe's principle, and to possess it is, after a fashion, to possess the universe itself 'God,' 'Matter,' 'Reason,’ 'the Absolute,’ ‘Energy,’ are so many solving names. You can rest when you have them. You are at the end of your metaphysical quest.” William James. "What Pragmatism Means". Lecture 2 in Pragmatism: A new name for some old ways of thinking. New York: Longman Green and Co (1922): 52-52. http://www.archive.org/stream/pragmatismnewnam00jame
  • 5. Clear definitions are good (!) We should not reflexively rely on metaphysical “solving” / “power-bringing” words… ADD to James’s list?: “Knowledge” “Information” “Data” ???
  • 6.
  • 8. Usage Data: The word data is the Latin plural of datum, neuter past participle of dare, "to give", hence "something given". “ Data leads a life of its own quite independent of datum, of which it was originally the plural. It occurs in two constructions: as a plural noun (like earnings), taking a plural verb and plural modifiers (as these, many, a few) but not cardinal numbers, and serving as a referent for plural pronouns; and as an abstract mass noun (like information), taking a singular verb and singular modifiers (as this, much, little), and being referred to by a singular pronoun. Both constructions are standard. The plural construction is more common in print, perhaps because the house style of some publishers mandates it.” The Merriam-Webster Online Dictionary http://www.merriam-webster.com/dictionary/data
  • 9. “Data” ? [technological] “…’data’ are defined as any information that can be stored in digital form and accessed electronically, including, but not limited to, numeric data, text, publications, sensor streams, video, audio, algorithms, software, models and simulations, images, etc.” -- Program Solicitation 07-601 “Sustainable Digital Data Preservation and Access Network Partners (DataNet)” Taken in this broadest possible sense, “data” are thus simply electronic coded forms of information. And virtually anything can be represented as “data” so long as it is electronically machine-readable.
  • 10. “The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes or 281 billion gigabytes) — was 10% bigger than we thought. The resizing comes as a result of faster growth in cameras, digital TV shipments, and better understanding of information replication. “By 2011, the digital universe will be 10 times the size it was in 2006. “As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home. “Fast-growing corners of the digital universe include those related to digital TV, surveillance cameras, Internet access in emerging countries, sensor-based applications, datacenters supporting “cloud computing,” and social networks. The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
  • 11. “The diversity of the digital universe can be seen in the variability of file sizes, from 6 gigabyte movies on DVD to 128-bit signals from RFID tags. Because of the growth of VoIP, sensors, and RFID, the number of electronic information “containers” — files, images, packets, tag contents — is growing 50% faster than the number of gigabytes. The information created in 2011 will be contained in more than 20 quadrillion — 20 million billion — of such containers, a tremendous management challenge for both businesses and consumers. alone. “ The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
  • 12. “Data” [epistemic] “Measurements, observations or descriptions of a referent -- such as an individual, an event, a specimen in a collection or an excavated/surveyed object -- created or collected through human interpretation (whether directly “by hand” or through the use of technologies)” -- AnthroDPA Working Group on Metadata (May, 2009)
  • 13. “The General Definition of Information (GDI)” σ is an instance of information, understood as semantic content, if and only if: • (GDI.1) σ consists of one or more data; • (GDI.2) the data in σ are well-formed; • (GDI.3) the well-formed data in σ are meaningful. Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information” (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophy http://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]
  • 14. “…with the corollary assumptions that they are objective -- that is, not conditioned by subjective perspectives and invariant – that is, true under all circumstances.” -- Draft GBIF DPFTG Report, 2009 SEE: R. Nozick, Invariances: The Structure of the Objective World, Harvard University Press, Cambridge, 2001. AND L. Daston and P. Galison, Objectivity, Zone Books, NY, 2007.
  • 15. The Diaphoric Definition of Data (DDD): “According to GDI, information cannot be dataless but, in the simplest case, it can consist of a single datum. Now a datum is reducible to just a lack of uniformity (diaphora is the Greek word for “difference”), so a general definition of a datum is: The Diaphoric Definition of Data (DDD): A datum is a putative fact regarding some difference or lack of uniformity within some context. “Depending on philosophical inclinations, DDD can be applied at three levels: 1. data as diaphora de re, that is, as lacks of uniformity in the real world out there. There is no specific name for such “data in the wild”. A possible suggestion is to refer to them as dedomena (“data” in Greek; note that our word “data” comes from the Latin translation of a work by Euclid entitled Dedomena). Dedomena are not to be confused with environmental data (see section 1.7.1). They are pure data or proto-epistemic data, that is, data before they are epistemically interpreted. As “fractures in the fabric of being” they can only be posited as an external anchor of our information, for dedomena are never accessed or elaborated independently of a level of abstraction (more on this in section 3.2.2). They can be reconstructed as ontological requirements, like Kant's noumena or Locke's substance: they are not epistemically experienced but their presence is empirically inferred from (and required by) experience. Of course, no example can be provided, but dedomena are whatever lack of uniformity in the world is the source of (what looks to information systems like us as) as data, e.g., a red light against a dark background. Note that the point here is not to argue for the existence of such pure data in the wild, but to provide a distinction that (in section 1.6) will help to clarify why some philosophers have been able to accept the thesis that there can be no information without data representation while rejecting the thesis that information requires physical implementation; …”
  • 16. The Diaphoric Definition of Data (DDD): (cont.) “2. data as diaphora de signo, that is, lacks of uniformity between (the perception of) at least two physical states, such as a higher or lower charge in a battery, a variable electrical signal in a telephone conversation, or the dot and the line in the Morse alphabet; and 3. data as diaphora de dicto, that is, lacks of uniformity between two symbols, for example the letters A and B in the Latin alphabet.” Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information” (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophy http://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]
  • 17. “Evidence”? “Data having probative value and authority”? i.e. well supported by scientific logic and considered technically valid
  • 18. Policy Formation and Decision Making
  • 19. Poder Politico y Conocimiento Alto ??? Políticos Responsabilidad y Poder Administradores o Gestores Analistas- Técnicos Científicos Alto Bajo Conocimiento (en términos científicos-occidentales) (Sutton, 1999) From: Organizaciones que aprenden, paises que aprenden: lecciones y AP en Costa Rica by Andrea Ballestero Directora ELAP
  • 20. Wednesday, January 21st, 2009 at 12:00 am MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES SUBJECT: Freedom of Information Act A democracy requires accountability, and accountability requires transparency. As Justice Louis Brandeis wrote, "sunlight is said to be the best of disinfectants." In our democracy, the Freedom of Information Act (FOIA), which encourages accountability through transparency, is the most prominent expression of a profound national commitment to ensuring an open Government. At the heart of that commitment is the idea that accountability is in the interest of the Government and the citizenry alike. The Freedom of Information Act should be administered with a clear presumption: In the face of doubt, openness prevails. The Government should not keep information confidential merely because public officials might be embarrassed by disclosure, because errors and failures might be revealed, or because of speculative or abstract fears. Nondisclosure should never be based on an effort to protect the personal interests of Government officials at the expense of those they are supposed to serve. In responding to requests under the FOIA, executive branch agencies (agencies) should act promptly and in a spirit of cooperation, recognizing that such agencies are servants of the public. All agencies should adopt a presumption in favor of disclosure, in order to renew their commitment to the principles embodied in FOIA, and to usher in a new era of open Government. The presumption of disclosure should be applied to all decisions involving FOIA…[clip] Barack Obama http://www.whitehouse.gov/the_press_office/Freedom_of_Information_Act/
  • 21. “Declaration of Scientific Principles” in “The Commonwealth of Science” “7. The pursuit of scientific inquiry demands complete intellectual freedom. And unrestricted international exchange of knowledge…“ from “The Commonwealth of Science ” Nature No.3753 October 4, 1941.
  • 22. August 4, 2009: the White House issued a memorandum stating unequivocally “Sound science should inform policy decisions” “Science and Technology Priorities for the FY2011 Budget,” PR Orszag and JP Holdren August 4, 2009, Memorandum for the Heads of Executive Departments and Agencies, M-09-27. http://www.whitehouse.gov/omb/assets/memoranda_fy2009/m09-27.pdf
  • 23. The $3.6 billion Large Hadron Collider (LHC) will sample and record the results of up to 600 million proton collisions per second, producing roughly 15 petabytes (15 million gigabytes) of data annually in search of new fundamental particles. To allow thousands of scientists from around the globe to collaborate on the analysis of these data over the next 15 years (the estimated lifetime of the LHC), tens of thousands of computers located around the world are being harnessed in a distributed computing network called the Grid. Within the Grid, described as the most powerful supercomputer system in the world, the avalanche of data will be analyzed, shared, re- purposed and combined in innovative new ways designed to reveal the secrets of the fundamental properties of matter. LHC source: http://public.web.cern.ch/public/en/LHC/L Source: http://public.web.cern.ch/Public/en/LHC/L
  • 24. “The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los Alamos Science and Technology Magazine August 2008 http://www.lanl.gov/news/1663/imag
  • 25. “The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los Alamos Science and Technology Magazine August 2008 http://www.lanl.gov/news/1663/images/aug08/22lg.jpg
  • 26. The (US) NCAR Research Data Archive (RDA) “The NCAR Research Data Archive (RDA) is a comparatively small (currently 246 TB, less than 5% of the MSS [Mass Storage System] total size), but very important, part of the MSS stored data. The RDA has been curated by the staff in the Computational and Information Systems Laboratory for over 40 years, [emphasis added] and as such contains reference datasets used by large numbers of scientists. The RDA contents are long-term atmospheric (surface and upper air) and oceanographic observations, grid analyses of observational datasets, operational weather prediction model output, reanalyses, satellite derived datasets, and ancillary datasets, such as topography/bathymetry, vegetation, and land use. The RDA is not a static collection; it is now over 580 datasets with about 100 routinely updated and 10-20 new ones added each year. “ C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 5. www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
  • 27. NCAR Research Data Archive (RDA) C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 7. www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
  • 29. Facebook? Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. (A petabyte is about 1,000 times as large as a terabyte, and could store about 500 billion pages of text.) Training to Climb an Everest of Digital Data By ASHLEE VANCE NYT Published: October 11, 2009 http://www.nytimes.com/2009/10/12/technology/12data.html?_r=1
  • 30. “Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment
  • 32. “Experiments to determine the density of the earth,” by Henry Cavendish, ESQ., F.R.S. AND A.S. Read June 21, 1798 (From the Philosophical Transactions of the Royal Society of London for the year 1798, Part II. , pp. 469-526) From: http://www.archive.org/details/lawsofgravitatio00mackrich
  • 33. 2-d_soil_temps.csv surface, and sub-surface soil temperatures (at 2cm and 8cm depths) measured at one location for a few days in order to calibrate a model of temperature propagation. Surface temperature was measured with an infrared thermometer, subsurface temperatures with a thermocouple. ---------------------------- 5-minute_light_data_for_4_continuous_days_plus_reference.xls PPF (photosynthetic photon flux = photosynthetically active radiation 400-700nm) measured with an array of photodiodes calibrated to a Licor sensor, along a linear transect for a few days. used to get an idea of how much light plants along the transect are receiving. ---------------------------- DATA CO2_of_air_at_different_heights_July_9.xls concentration of CO2 in the air during the evening for one day, measured with a Licor infrared gas analyzer and a series of relays and tubes with a pump. used to examine the gradient of CO2 coming from the soil when the air is still during the evening. SETS ---------------------------- Fern_light_response.xls Light response curves for bracken ferns, measured with a Licor photosynthesis system. Fronds are exposed to different light levels and their instantaneous photosynthesis and conductance is measured. used in conjunction with the induction data (below) for physiological characterization of the ferns. ---------------------------- La_Selva_species_photosyntheis_table.xls incomplete data set on instantaneous photosynthesis rates for various tropical understory and epiphytic species grown in a shade house in Costa Rica. ---------------------------- some manzanita_sapflow_12-5-07_to_7-7-08.xls instantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground examples measures of root grown and CO2 production. ---------------------------- moisture_release_curves.xls with “native percentage of water content, water potential (in MegaPascals) and temperature of soil samples, measured in the laboratory for calibration of water content with water potential. soil is from the James Reserve in California. ---------------------------- Photosynthetic_induction.xls metadata” 2 O C . 5 3 v l d n y h p f s r u o c - e m i t a � m/2/s and light level is probably 1000 micromoles. used to determine physiological characteristics of bracken ferns. ---------------------------- run_2_24-h_data_for_mesh.xls measurements of micrometeorological parameters on a moving shuttle, going from a clearing across a forest edge and into the forest for about 30 meters. Pyronometers facing up and down, pyrgeometer facing up and down, PAR, air temperature, relative humidity. Also data from a station fixed in the clearing and some derived variables calculated. used for examining edge effects in forests. ---------------------------- Segment_of_wallflower_compare_colorspaces_blur.xls pixel counts from images of wallflowers that were segmented into flower/not-flower under different color spaces. segmentation was made using a probability matrix of hand-segmented images. used to automatically count flowers in images collected after this training data was collected (and used to determine the best color space for this task).
  • 34. manzanita_sapflow_12-5-07_to_7-7-08.xls instantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground measures of root grown and CO2 production. sbid battery datetime heater_voltage Manz1Sap1 Manz1Sap2 Manz1Sap3 Manz1Sap4 Manz2Sap5 Manz2Sap6 Manz2Sap7 Manz3Sap10 Manz3Sap8 Manz3Sap9 Manz4Sap11 timestamp Datagap Julian 2 12.365 1196796112 2018.8 0.5585 0.51029 0.55517 0.54354 0.6067 0.52858 0.55351 0.59008 0.59506 0.60337 0.56514 12/4/07 11:21 4.47351 3 12.348 1196796232 2017.9 0.55682 0.51028 0.5535 0.54352 0.60669 0.52857 0.55017 0.59007 0.59505 0.60336 0.56513 12/4/07 11:23 0 4.47490 4 12.357 1196796352 2018.6 0.55514 0.51027 0.55348 0.54351 0.60501 0.52855 0.55016 0.59005 0.59504 0.60501 0.56512 12/4/07 11:25 0 4.47628 5 12.354 1196796472 2017.6 0.55514 0.51026 0.55181 0.5435 0.60334 0.52855 0.54849 0.59004 0.59503 0.60334 0.56511 12/4/07 11:27 0 4.47767 6 12.334 1196796592 2018.3 0.55347 0.51026 0.55015 0.5435 0.60333 0.52854 0.54682 0.59004 0.59502 0.605 0.56511 12/4/07 11:29 0 4.47906 7 12.34 1196796712 2018.5 0.55014 0.50859 0.55014 0.54349 0.60332 0.53019 0.54349 0.59003 0.59501 0.60498 0.56676 12/4/07 11:31 0 4.48045 8 12.337 1196796832 2017.8 0.55013 0.50692 0.55013 0.54348 0.60332 0.53019 0.54182 0.59002 0.59501 0.60498 0.56675 12/4/07 11:33 0 4.48184 9 12.328 1196796952 2017.5 0.5468 0.50691 0.5468 0.54347 0.60331 0.53018 0.53849 0.59001 0.595 0.60497 0.56674 12/4/07 11:35 0 4.48323 10 12.323 1196797072 2017 0.54679 0.50524 0.54679 0.54347 0.59998 0.53017 0.53682 0.59 0.59499 0.60496 0.56674 12/4/07 11:37 0 4.48462 11 12.328 1196797192 2018.9 0.54679 0.50191 0.54512 0.5418 0.59665 0.53017 0.53349 0.59 0.59498 0.60496 0.56673 12/4/07 11:39 0 4.48601 12 12.319 1196797312 2017.7 0.54345 0.49857 0.54178 0.54178 0.59663 0.53015 0.53015 0.58998 0.5933 0.60327 0.56671 12/4/07 11:41 0 4.48740 13 12.311 1196797432 2017.3 0.54343 0.4969 0.54011 0.54177 0.59661 0.53014 0.52848 0.58997 0.59329 0.6016 0.5667 12/4/07 11:43 0 4.48878 14 12.316 1196797552 2018.6 0.5401 0.49357 0.53678 0.54176 0.59328 0.53013 0.5268 0.58995 0.59328 0.60325 0.56669 12/4/07 11:45 0 4.49017 15 12.31 1196797672 2016.8 0.53844 0.4919 0.53511 0.54176 0.59494 0.53013 0.52514 0.58995 0.59328 0.60325 0.56503 12/4/07 11:47 0 4.49156 16 12.31 1196797792 2017.1 0.53676 0.48856 0.53343 0.54174 0.59326 0.53011 0.5218 0.58993 0.59326 0.60323 0.56501 12/4/07 11:49 0 4.49295 17 12.31 1196797912 2017.1 0.53342 0.48523 0.5301 0.54173 0.59324 0.5301 0.51846 0.58826 0.59324 0.60321 0.56499 12/4/07 11:51 0 4.49434 18 12.301 1196798031 2017.5 0.53174 0.48521 0.52842 0.53839 0.59156 0.53008 0.51845 0.58824 0.59323 0.6032 0.56498 12/4/07 11:53 0 4.49573 19 12.301 1196798151 2016.3 0.53007 0.48188 0.52509 0.53838 0.59155 0.53007 0.51512 0.58823 0.59321 0.60152 0.5633 12/4/07 11:55 0 4.49712 20 12.303 1196798271 2016.6 0.5284 0.47855 0.52175 0.53837 0.59154 0.5284 0.5151 0.58821 0.59154 0.60151 0.56163 12/4/07 11:57 0 4.49851 Datum: “0.59998”
  • 35. “Jim Gray on eScience: A Transformed Scientific Method” T. Hey, S. Tansley, and K.Tolle (eds)| Microsoft Research Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on January 11, 2007 http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf
  • 36. “Reanalyses” [or Meta-Analyses ] “Atmospheric reanalyses are a main feature within the RDA and were intended to be, and have become, a very valuable data resource for a wide variety of climate and weather studies. By combining many types of atmospheric observations with advanced data assimilation and forecast models a “best possible” 3D estimate of the atmospheric state over extended time periods is achieved. “Reanalyses are supported by many historical data sources that have been curated over time. As an illustration the major sources of atmospheric profile data include wind only soundings beginning in 1920 (Figure 2). These are augmented with soundings of temperature, humidity, and wind beginning in 1948. “ C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 6. www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
  • 37. Fundamental Questions: • Data Specification – scientific logic of data definition • Data Creation – specification of methodology • Data Integrity – preservation -- “chain of custody” “Chain of custody refers to the chronological documentation or paper trail, showing the seizure, custody, control, transfer, analysis, and disposition of evidence, physical or electronic.” [ http://en.wikipedia.org/wiki/Chain_of_custody [clipped 11/12/09 10:30pm PST] • Data transformations – Logic – Competence /Technical Performance / Execution
  • 38. “Keeping Raw Data in Context” “…any initiative to share raw clinical research data must also pay close attention to sharing clear and complete information about the design of the original studies. Relying on journal articles for study design information is problematic, for three reasons. First, journal articles often provide insufficient detail when describing key study design features such as randomization (1) and intervention details (2). Second, some data sets may come from studies with no publications [only 21% of oncology trials registered in ClinicalTrials.gov before 2004 and completed by September 2007 were published (3)]. Finally, investigators cannot reliably search journal articles for methodological concepts like “double blinding” or “interrupted time series,” crucial concepts for proper interpretation of the data. A mishmash of non- standardized databases of raw results and unevenly reported study designs is not a strong foundation for clinical research data sharing. “ “ We believe that the effective sharing of clinical research data requires the establishment of an interoperable federated database system that includes both study design and results data. A key component of this system is a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to facilitate cross-study comparison and synthesis. “ I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713.
  • 39. “Increasing levels of coordinate digit noise associated with repeated projection transformations” Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.
  • 40. “It is well known that cartographic coordinates stored in double precision are far more precisely specified than is merited by their accuracy, even for highly-accurate global datasets. Far more coordinate digit places are stored for the sake of avoiding machine error than are needed to define the location of map objects within the necessary tolerances for both absolute and relative accuracies.” “A careful look at the coordinate digits stored as double precision variables in a GIS yields a variety of interesting patterns that are a result of previous machine error, rounding error, measurement error, and so forth. Any slight cartographic alteration (rotation/skewing, clipping/sub-setting, reprojecting, etc.) can add noise into the coordinate and can be used to characterize a vector dataset.” Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.
  • 41. GRIDS Data International Centers Collaborative Research Effort Individual National Disciplinary Initiatives Libraries Cooperative Projects Local / Individuals Personal Archiving “Small Science” “BIG Science”
  • 43. The “small science,” independent investigator approach traditionally has characterized a large area of experimental laboratory sciences, such as chemistry or biomedical research, and field work and studies, such as biodiversity, ecology, microbiology, soil science, and anthropology. The data or samples are collected and analyzed independently, and the resulting data independently sets from such studies generally are heterogeneous and unstandardized, with unstandardized few of the individual data holdings deposited in public data repositories or openly shared. The data exist in various twilight states of accessibility, depending on accessibility the extent to which they are published, discussed in papers but not revealed, or just known about because of reputation or ongoing work, but kept under absolute or relative secrecy. The data are thus disaggregated components of an incipient network that is only as effective as the individual transactions that put it together. Openness and sharing are not ignored, but they are not together necessarily dominant either. These values must compete with strategic considerations of self-interest, secrecy, and the logic of mutually beneficial exchange, particularly in areas of research in which commercial applications are more readily identifiable. The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8
  • 44. Maria Sibylla Merian Metamorphosis insectorum Surinamensium (Metamorphosis of the Insects of Surinam) Amsterdam, 1705, figure 46 Hand-colored engraving (123) http://www.loc.gov/exhibits/dres/dre123.jpg
  • 45. DARWIN http://darwin- online.org.uk/converted/published/1975_NaturalSelection_F15 83/1975_NaturalSelection_F1583_fig03.jpg http://www.nyu.edu/projects/materialworld/images/1_ Darwin%20Tree%20B%2036.jpg
  • 46. FIELD NOTES FROM THE AMERICAN MUSEM CONGO EXPEDITION 1909-1915 http://diglib1.amnh.org/cgi-bin/database/index.cgi
  • 48.
  • 49. Rheinardia ocellata, the Crested Argus. Photographed at night by an automatic camera-trap in the Ngoc Linh foothills (Quang Nam Province). Courtesy AMNH Center for Biodiversity and Conservation
  • 50.
  • 51. By Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE: http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=cse http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14
  • 52. How many data sources contributed to this analysis?
  • 53. The “small science,” independent investigator approach traditionally has characterized a large area of experimental laboratory sciences, such as chemistry or biomedical research, and field work and studies, such as biodiversity, ecology, microbiology, soil science, and anthropology. The data or samples are collected and analyzed independently, and the resulting data independently sets from such studies generally are heterogeneous and unstandardized, with unstandardized few of the individual data holdings deposited in public data repositories or openly shared. The data exist in various twilight states of accessibility, depending on accessibility the extent to which they are published, discussed in papers but not revealed, or just known about because of reputation or ongoing work, but kept under absolute or relative secrecy. The data are thus disaggregated components of an incipient network that is only as effective as the individual transactions that put it together. Openness and sharing are not ignored, but they are not together necessarily dominant either. These values must compete with strategic considerations of self-interest, secrecy, and the logic of mutually beneficial exchange, particularly in areas of research in which commercial applications are more readily identifiable. The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8
  • 54. GRIDS Data International Centers Collaborative Research Effort Individual National Disciplinary Initiatives Libraries Cooperative Projects Local / Individuals Personal Archiving “Small Science” “BIG Science”
  • 55.
  • 56. Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430 http://ocde.p4.siteinternet.com/publications/doifiles/publishin g-standards-data-2009.pdf
  • 57. Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430 http://ocde.p4.siteinternet.com/publications/doifiles/publishing-standards-data-2009.pdf
  • 58. What does “Full Life-Cycle” Data Management Mean ?
  • 59. US NSF “DataNet” Program “the full data preservation and access lifecycle” • “acquisition” • “documentation” • “protection” • “access” • “analysis and dissemination” • “migration” • “disposition” “Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07- 601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information Science & Engineering
  • 61. How do we Incentivize Change ? • Individuals • Professions / Disciplines • Organizations • Institutions (Universities, Research Institutes, Museums, Gardens, Herbaria, Aquariums, Zoos) • “Memory Institutions” (Libraries, Archives) • Governments • Funders / Sponsors • Publishers!
  • 62. Individual’s willingness to share: the Core functions of Scholarly Communication • “Registration, which allows claims of precedence for a scholarly finding. • “Certification, which establishes the validity of a registered scholarly claim. • “Awareness, which allows participants in the scholarly system to remain aware of new claims and findings. • “Archiving, which preserves the scholarly record over time. • “Rewarding, which rewards participants for their performance in the communication system based on metrics derived from that system. Roosendaal, H., Geurts, P in Cooperative Research Information Systems in Physics (Oldenburg, Germany, 1997).
  • 64. • Norms and standards for sharing vary by discipline • In “big science” (astrophysics / astronomy / meteorology / oceanography / genomics) sharing is expected (if not required) and contributions to a common fund of knowledge are assumed (See also: GENBANK ) – Standards are relatively clear – Mechanisms for sharing are well-developed – Collective / collaborative authorship is commonplace • In “small science” such norms are weaker
  • 65. Small Science: Data Deposit and Access • Data are typically held in many formats • Discovery of data is very weakly supported by standards-development • Access to and use of data are highly variable • [ However progress has been made respecting museum specimen data in the past 20 years [SEE for ex. : GBIF and many allied projects] ] • Some progress has been made respecting observational and other data • Ecological and conservation field data remain highly problematic
  • 66. Some suggestions for action include: government agencies and private foundations must both set strict requirements for effective sharing – with serious penalties (such as disqualification for future research funding) for failures to share; • peer review processes must include rigorous scrutiny of past histories of sharing and must require state-of-the-art planning for sharing (not simply a promise to “put data up on the Web” ]; • negotiations for “overhead” (“indirect costs”) compensation from funders must include examination of digital infrastructure adequate for sharing and maintenance of data; • accreditation bodies for educational institutions and museums must start to require demonstrated evidence of capacity to support digital access and maintenance of data; • professional societies and professional disciplines must begin to require evidence of effective sharing of data in evaluating credentials for hiring, promotion and tenure;
  • 67.
  • 68.
  • 69.
  • 70. http://www.mikero.com/blog/2009/02/20/more-darwin http://www.zazzle.com/darwin2009
  • 71. From: Tom Moritz [mailto:tom.moritz@gmail.com] Sent: Thursday, November 12, 2009 1:46 AM To: Donat Agosti Subject: Snapple Real Fact #134: " An ant can lift 50 times its own weight. ” Is this true? Tom ________________________________________________ From: Donat Agosti <agosti@amnh.org> Date: Wed, Nov 11, 2009 at 8:03 PM Subject: RE: Snapple Real Fact #134: " An ant can lift 50 times its own weight. " To: Tom Moritz tom.moritz@gmail.com People says so [emphasis added] – but we once looked for the evidence, but could not find a scientific paper confirming this. D
  • 72. Iobi Ludolfi aliàs Leut-holf dicti Historia Æthiopica, sive Brevis & succincta descriptio regni Habessinorum, quod vulgò malè Presbyteri Iohannis vocatur : 2009 Cambridge University Library "They [the hippopotami] present the following appearance; four- footed, with cloven hooves like cattle; blunt-nosed; with a horse's mane, visible tusks, a horse's tail and voice; big as the biggest bull. Their hide is so thick that, when it is dried, spearshafts are made of it.” Herodotus, The Histories (with an English translation by A. D. Godley). Cambridge. Harvard University Press. 1920. LXXI http://old.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus%3Aabo%3Atlg%2C0016%2C001&query=2%3A71%3A1 [clipped 11/12/09]
  • 73. a problem with “evidence”… “…the great trouble with the world was that which survived was held in hard evidence as to past events. A false authority clung to what persisted, as if those artifacts of the past which had endured had done so by some act of their own will.” -- Cormac McCarthy The Crossing
  • 74. “Πάντα ῥ εῖ καὶ οὐ δὲ ν μένει” Heraclitus: “Everything flows, nothing stands still.” All data is dynamic
  • 75. From examination of elephants’ skulls the early Greeks deduced that a species of humanoid Cyclops existed… (SEE -- for example -- The Odyssey and Ulysses encounter with Polyphemus on the island of Sicily… ) http://www.amnh.org/exhibitions/mythiccreatures/land/greek.php
  • 76. Another deduction from the evidence of narwhal tusks… “In the Middle Ages, narwhal tusks were widely thought to be unicorn horns with magical, curative properties. Indeed, cups made from narwhal tusks (above) were thought to neutralize poisons and were highly valued. “ http://www.amnh.org/exhibitions/mythiccreatures/land/unicorns.php
  • 77. Kirtland’s Warbler / Abaco Island, The Bahamas
  • 78. “NATIVE” METADATA DEAD HARBOR SEAL and 5 CALIFORNIA CONDORS !!!