Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Ibis: Scaling the Python Data Experience

Delivered at Data Science Summit July 20, 2015. See http://ibis-project.org for more

Ibis: Scaling the Python Data Experience

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Scaling  the  Python  Data   Experience   Wes  McKinney                    Marcel  Kornacker   JusFn  Erickson    Silvius  Rus  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Wes  McKinney   •  A  key  person  in  building  today’s  open  source  Python  data  community   •  Creator  of  pandas,  a  standard  Python  data  wrangling  and  analyFcs  toolkit  used   by  data  scienFsts   •  Author  of  best-­‐selling  canonical  text  Python  for  Data  Analysis  (2012)   •  Formerly  Founder/CEO  of  DataPad  (acquired  by  Cloudera  in  2014)  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Python  is  popular…   •  Python  has  become  a  standard  language  of  data  science   •  Why  is  it  popular?   • Maximizes  producFvity  for  data  engineers  and  data  scienFsts   • Build  robust  so[ware  and  do  interacFve  data  analysis  with  100%  Python  code     • Easy-­‐to-­‐learn  and  makes  happy  and  producFve  data  teams     • Large,  diverse  open  source  development  community   • Comprehensive  libraries:  data  wrangling,  ML,  visualizaFon,  etc.   •  Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium   size  data  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   …but  Python  does  not  scale  today   •  Python  ecosystem  confined  to  single-­‐node  analysis   • Great  for  smaller  data  sets   • Requires  sampling  or  aggregaFons  for  larger  data   • Distributed  tools  compromise  in  various  ways   •  ExtracFng  samples  or  aggregaFons  for  larger  data  means:   • “Scales”  by  losing  more  fidelity   • AddiFonal  ETL  overhead  to  extract  samples/aggregaFons   • Loss  of  producFvity  with  mulFple  languages,  tools,  etc   • Blocks  certain  analysis  and  use  cases  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Same  Python,  now  at  scale   •  Target  user:   • Data  scienFsts  and  data  engineers  (“Python  data  users”)   •  Goals:   • Mirrors  single-­‐node  Python  experience   • Scales  to  any  node  and  data  size   • No  compromise  in  funcFonality  or  usability   • InteracFve  experience  at  naFve  hardware  speeds  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  announced?   •  First  public  release  of  Ibis   • hgp://ibis-­‐project.org   •  Beta  release  to  Cloudera  Labs   •  InviFng  usage  and  community  development   •  Apache-­‐licensed  open-­‐source  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis’s  Vision   •  Uncompromised  Python  experience   • 100%  Python  end-­‐to-­‐end  user  workflows     • Enable  integraFon  with  the  exisFng  Python  data  ecosystem  (pandas,  scikit-­‐ learn,  NumPy,  etc)   •  InteracFve  at  big  data  scale   • Full-­‐fidelity  analysis  without  extracFons   • Scalability  for  big  data   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Advantages  of  our  approach   •  Analyze  big  data  100%  in  Python,  with  the  same  ease  as  small/medium  data  on   the  local  filesystem   •  Full-­‐fidelity  data  access   •  Familiar  Python  experience  and  integraFon  with  exisFng  Python  data  libraries   •  Provide  a  means  for  Python  high  performance  compuFng  tools  to  be  leveraged  at   Hadoop-­‐scale  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Beta  0.3  release     •  High  level  Python  API  for  describing  analyFcs  and  ETL  that  can  be  executed  by   Impala   • Familiar  API  for  users  of  pandas   • Comprehensive  coverage  of  operaFons  expressible  as  relaFonal  data  flows   •  Integrated  tools  for  managing  data  in  HDFS   •  Simple  workflows  to  query  data  files  in  several  formats  (Parquet,  Avro,  Text)   •  pandas  data  interchange  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis/Impala  Joint  Roadmap   •  More  natural  data  modeling   • Complex  types  support   •  IntegraFon  with  full  Python  data  ecosystem   • Advanced  analyFcs  +  machine  learning   • Enable  use  of  performance  compuFng  tools   •  User  extensibility  with  naFve  performance   • In-­‐memory  columnar  format   • Python-­‐to-­‐LLVM  IR  compilaFon   •  Workflow  and  usability  tools  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Benefits  of  Ibis   •  Maximize  developer  producFvity   • Mirrors  single-­‐node  Python  experience   • Solve  big  data  problems  without  leaving  Python   • Leverage  Python  skills,  ecosystem,  and  tools   •  Python  as  first-­‐class  language  for  Hadoop   • Full-­‐fidelity  analysis  without  extracFons   • Python  analysis  at  any  scale   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   wes@cloudera.com  

×