4. Tasks of data-mining
1. Classification
2. Prognosing
3. Visualization
4. Reasoning
5. Analysis
6. Expert systems
5. Big data in materials science
EXAMPLE: nearly for the last 4 years
with my colleagues-theoreticians we produced:
over 9000 simulation output files
over 50 articles
6.
7. 1. Accelrys Pipeline Pilot and Materials Studio, http://accelrys.com/products
2. AFLOW framework and Aflowlib.org repository, http://www.aflowlib.org
3. AIDA, Bosch LLC
4. Blue Obelisk Data Repository (XSLT, XML), http://bodr.sourceforge.net
5. CCLib (Python), http://cclib.sf.net
6. CDF (Python), http://kitchingroup.cheme.cmu.edu/cdf
7. CMR (Python), https://wiki.fysik.dtu.dk/cmr
8. Comp. Chem. Comparison and Benchmark Database, http://cccbdb.nist.gov
9. cctbx: Computational Crystallography Toolbox, http://cctbx.sourceforge.net
10. ESTEST (Python, XQuery), http://estest.ucdavis.edu
11. J-ICE online viewer (based on Jmol, Java), http://j-ice.sourceforge.net
12. Materials Project (Python), http://www.materialsproject.org
13. PAULING FILE world largest database for inorganic compounds, http://paulingfile.com
14. Quixote, http://quixote.wikispot.org
15. Scipio (Java), https://scipio.iciq.es
16. WebMO: Web-based interface to computational chemistry packages (Java,
Perl), http://webmo.net
New type of modeling software
8. …and smart codes
ENCUT = 500
IBRION = 2
ISIF = 3
NSW = 20
IDIOT = 3
NELMIN = 5
EDIFF = 1.0e-08
EDIFFG = -1.0e-08
IALGO = 38
ISMEAR = 0
LREAL = .FALSE.
LWAVE = .FALSE.
*** VASP MASTER: I AM SURE YOU KNOW WHAT
YOU ARE DOING ***
9. d-metal oxides
band gap problem
standard DFT GGA
approach
Hartree-Fock
admixing
LCAO
approximation
Usage of Gaussian
basis sets
good atomization
energy
Example of inference over an ontology
12. Open data, open standards, open source in
chemistry
1.Elsevier, Wiley, Springer publishers are “evil”
2.“The right to read is right to mine”
3.“Jailbreaking” the scientific data from PDFs:
access, reuse, integrity
4.Why the level of collaboration is so low?
15. Advantages of Python
Syntax: tabulation, syntactic sugar, speech-
like, flexibility, expression
VERY fast prototyping
Great popularity in scientific community
100% cross-platform and portable
16. Disadvantages of Python
Relatively slow speed comparing to compiled
languages like C++ or Fortran
Global Interpreter Lock (GIL)
Historically not popular in some narrow
scientific areas (“reigns” of Java)
17. Two examples
list = [x**2 for x in range(10)]
numbers = [10, 4, 2, -1, 6]
filter(lambda x: x < 5, numbers)