This document discusses the role of data scientists in analyzing large and complex datasets to help answer critical questions. It notes that over 95% of digital data is unstructured and organizations lose millions annually due to inefficient use of information. Data scientists can help transform this data into usable knowledge by developing expertise in both data management and specific domains. They work with infrastructure experts and domain experts to analyze "big data" and solve grand challenges across many fields.
3. Kilo, Mega, Giga, Tera, Peta, ExaZetta = 1021 bytes Over 95% of the digital universe is "unstructured data" – meaning its content can't be truly represented by its field in a record, such as name, address, or date of last transaction. In organizations, unstructured data accounts for more than 80% of all information. Source: IDC …An organization employing 1,000 knowledge workers loses $5.7 million annually just in time wasted having to reformat information as they move among applications. Not finding information costs that same organization an additional $5.3m a year. Source: IDC
4. Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audio Heterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systems Chaotic information life cycle, where little time and effort is spent on what should be kept and what can be discarded Diverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running Linux Why Data Science?
5. How will global climate change affect sea levels in major coastal metropolitan areas worldwide? Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59? What gene sequences in cereal grains are associated with greater crop yields in arid environments? How can we reduce false positives in automated airline baggage scans without reducing accuracy? What Internet data can be mined as predictive of firm creation among startups that provide new jobs? Critical Questions
6. Water sustainability Climate analysis and prediction Energy through fusion CO2Sequestration Hazard analysis and management Cancer detection and therapy Drug design and development Advanced materials analysis New combustion systems Virtual product design In silico semiconductor design “Big Data” Provides Answers NSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
7. NSF Advisory Committee for Cyberinfra-structure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf “All grand challenges face barriers due to challenges in software, in data management and visualization, and in coordinating the work of diverse communities that must work together to develop new models and algorithms, and to evaluate outputs as a basis for critical decisions.”
8. Knowledge Development for Industry, Education, Government, Research Domain Experts Infrastructure Professionals Information Organization & Visualization Expertise in specific subject areas Rapid pace of IT development Limited opportunity to master technology skills Limited expertise in domain areas Data Scientists Information Analysis SolutionIntegration Proliferation of big data & new technology Specialized knowledge of HW, FW, MW, SW Digital Curation Need for knowledge and information managers Communication challenges Data Scientists: Transforming Data Into Decisions
9. A Definition of A Data Scientist A data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to: Help infrastructure experts with the architecture of hardware and software to manage big data challenges Help domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentations Help institutions and organizations control and curate data throughout the information lifecycle
Hinweis der Redaktion
Facebook friend connections worldwide, a network diagram of the Enron email set, a comparison of similar gene sequences between humans, chimps, and macaques