SlideShare a Scribd company logo
1 of 27
Download to read offline
Exploring GitHub data with
Apache Drill on Arm64
Ganesh Raju Naresh Bhat
Who Are We Anyway?
What is Linaro:
Leading collaboration in the ARM ecosystem
Apache Drill
Open source distributed SQL query engine for non-relational datastores
- JSON document model
- Columnar
Key Advantages
- Columnar
- Schema on the fly
- Integrates with any non-relational datastore
- Elastic scalability
- Data can be treated like SQL Tables
- SQL like query syntax
- No overhead (creating and maintaining schemas, ETL process, etc )
- Vectorization (SIMD instructions)
Apache Drill on Arm64 Server
Test environment - SW basic configuration
Architecture Gigabyte Marvell® ThunderX2® "Saber" 3 node cluster
OS platform Debian GNU/Linux 9.9 (stretch)
Linux Kernel version Debian 4.16.13.linaro.290-1
GCC version gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
GlibC version Debian GLIBC 2.24-11+deb9u4
JAVA version openjdk version "1.8.0_191"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.191-b12, mixed mode)
Hadoop version Hadoop 2.8.5
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
0b8464d75227fcee2c6e7f2410377b3d53d3d5f8
Compiled by jdu on 2018-09-10T03:32Z
Compiled with protoc 2.5.0
Using upstream release packages from apache.org.
Running on commercially available Arm server based on Marvell ThunderX2.
Test environment - SW basic configuration
Zookeeper and libzookeeper-java version 3.4.9-3+deb9u2
Apache Drill version v1.16.0
Jupyter Notebook version
Dataset 3 TB+ of github activity dataset contains a full snapshot . The content is more than 2.8 million open
source GitHub repositories. Which includes more than 145 million unique commits
Can replicate this demo using upstream release packages and open source data set.
jupyter core 4.5.0
jupyter-notebook 6.0.1
qtconsole 4.5.4
ipython 7.7.0
ipykernel 5.1.2
jupyter client 5.3.1
jupyter lab 1.0.9
nbconvert 5.6.0
ipywidgets 7.5.1
nbformat 4.4.0
traitlets 4.3.2
Select * from drillbits;
Select files in dfs;
Top projects this year in Github
Need to paste Apache Drill query snapshot
Top contributors by year
Need to paste Apache Drill query snapshot
Top contributors to Linux by year
Need to paste Apache Drill query snapshot
Top contributors to Bigdata
(Hadoop, Spark, HBase, Hive, drill, etc)
Need to paste Apache Drill query snapshot
Contributors by Country
SELECT * FROM
dfs.`/usersummary/*.json` limit
20
Language Popularity Score
SELECT * FROM
dfs.`/usersummary/*.json` limit 20
Top Python repositories by their commits count
SELECT * FROM
dfs.`/usersummary/*.json` limit 20
Top Apache Projects by contribution
Need to paste Apache Drill query snapshot
Who Are We Anyway?
We are Linaro: Leading collaboration in the
Arm ecosystem
Linaro: Open Source
Delivering high value collaboration
Top 5 company contributor in Linux
kernel
Contributor to >70 open source projects;
many maintained by Linaro engineers
Company 4.8-4.13 Changesets %
1 Intel 10,833 13.1%
2 Red Hat 5,965 7.2%
3 Linaro 4,636 5.6%
Source: Linux Kernel Development Report, Linux Foundation
Selected projects Linaro contributes to
Linaro: BigData Objective
● Ensure that Arm is a first class platform for Hadoop and Spark.
● Profile Hadoop and Spark for real world workloads on 64-bit Arm server
systems.
● Ensure that OpenJDK is running optimally against Hadoop and Spark workloads.
❏ Founded in November 1990
❏ Designs the RISC processor cores
❏ Licenses Arm core designs to
partners who fabricate and sell
to their customers
Arm Ecosystem momentum continues to accelerate
www.arm.com
Workloads
Networking
Virtualization &
Containers
Language & Library
Operating system
COMPANY FOUNDED
1995
FY19 REVENUE
$2.9B
EMPLOYEES
6,000+
LOCATED IN
Santa Clara, CA
R&D CENTERS
US, Israel, India,
Germany, China
PATENTS WORLDWIDE
10,000+
23
Marvell
© 2019 Marvell Confidential, All Rights Reserved.
24© 2019 Marvell Confidential, All Rights Reserved.
• Up to 32 custom Armv8.1 cores, up to 2.5GHz
• Full Out-of-Order, 1, 2, 4 threads per core
• 1S and 2S Configuration
• Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC
• Up to 56 lanes of PCIe Gen3, 14 PCIe controllers
ThunderX2 Second Generation High-End Armv8-A Server SoC
25
Marvell powers
the world’s fastest
Arm-based
supercomputer
Driven by 145,152 (5,184 CPUs x 28
cores) ThunderX2 cores
Securing U.S. nuclear arsenal
© 2019 Marvell Confidential, All Rights Reserved.
Marvell-University of Michigan Partnership
Built on Cavium/Marvell-Michigan relationship
Deploy ThunderX for Big Data
● 4800 Cores
● 25 TB Memory
● 40 & 100 Gbps networking
● 3 PB Hadoop File System
Accelerating the software ecosystem for data science for Arm.
Directly consuming Linaro Big Data software builds
We bring an advanced user base in the data science domain
Questions ?
Contact Us:
Ganesh Raju
ganesh.raju@linaro.org
Naresh Bhat
nareshb@marvell.com
naresh.bhat@linaro.org
Blogpost
https://nbhatlinaro.blogspot.com/2019/04/apache-drill-on-arm64.html
Thanks to Linaro Team:
Yuqi Gu
Jun He
Guodong Xu
Inspiration from Felipe Hoffa’s talks on Google
BigQuery
https://s3.amazonaws.com/connect.linaro.org/bkk19/presentations/bkk19-
300k1.pdf

More Related Content

What's hot

BeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPCBeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPC
inside-BigData.com
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 

What's hot (20)

BeeGFS Enterprise Deployment
BeeGFS Enterprise Deployment BeeGFS Enterprise Deployment
BeeGFS Enterprise Deployment
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
 
Ceph and OpenStack - Feb 2014
Ceph and OpenStack - Feb 2014Ceph and OpenStack - Feb 2014
Ceph and OpenStack - Feb 2014
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?
 
Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016
 
CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!
 
RedisConf17 - Rax, Listpack and Safe Contexts
RedisConf17 - Rax, Listpack and Safe ContextsRedisConf17 - Rax, Listpack and Safe Contexts
RedisConf17 - Rax, Listpack and Safe Contexts
 
Architecting Ceph Solutions
Architecting Ceph SolutionsArchitecting Ceph Solutions
Architecting Ceph Solutions
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
BeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPCBeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPC
 
librados
libradoslibrados
librados
 
Ceph Day Melabourne - Community Update
Ceph Day Melabourne - Community UpdateCeph Day Melabourne - Community Update
Ceph Day Melabourne - Community Update
 
Storage based on_openstack_mariocho
Storage based on_openstack_mariochoStorage based on_openstack_mariocho
Storage based on_openstack_mariocho
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
 
Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
 

Similar to Exploring Github Data with Apache Drill on ARM64

Similar to Exploring Github Data with Apache Drill on ARM64 (20)

Linux one vs x86
Linux one vs x86 Linux one vs x86
Linux one vs x86
 
Linux one vs x86 18 july
Linux one vs x86 18 julyLinux one vs x86 18 july
Linux one vs x86 18 july
 
RISC V in Spacer
RISC V in SpacerRISC V in Spacer
RISC V in Spacer
 
spack_hpc.pptx
spack_hpc.pptxspack_hpc.pptx
spack_hpc.pptx
 
Rhel7 vs rhel6
Rhel7 vs rhel6Rhel7 vs rhel6
Rhel7 vs rhel6
 
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community) [발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
 
LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
 
Arm - ceph on arm update
Arm - ceph on arm updateArm - ceph on arm update
Arm - ceph on arm update
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?
 
Red Hat for IBM System z Update v5
Red Hat for IBM System z Update v5Red Hat for IBM System z Update v5
Red Hat for IBM System z Update v5
 
Cross-compilation native sous android
Cross-compilation native sous androidCross-compilation native sous android
Cross-compilation native sous android
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
NET core 2 e i fratelli
NET core 2 e i fratelliNET core 2 e i fratelli
NET core 2 e i fratelli
 
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
 

More from Ganesh Raju

Certificate_DataStax_Cassandra
Certificate_DataStax_CassandraCertificate_DataStax_Cassandra
Certificate_DataStax_Cassandra
Ganesh Raju
 

More from Ganesh Raju (10)

Technology trends, disruptions and Opportunities
Technology trends, disruptions and OpportunitiesTechnology trends, disruptions and Opportunities
Technology trends, disruptions and Opportunities
 
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
 
Apache Ambari on ARM Server - Linaro Connect
Apache Ambari on ARM Server - Linaro ConnectApache Ambari on ARM Server - Linaro Connect
Apache Ambari on ARM Server - Linaro Connect
 
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereApache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache Bigtop
 
ODPi (Open Data Platform Initiative) - Linaro Connect
ODPi (Open Data Platform Initiative) - Linaro ConnectODPi (Open Data Platform Initiative) - Linaro Connect
ODPi (Open Data Platform Initiative) - Linaro Connect
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Technology Trends, Disruptions and Opportunities
Technology Trends, Disruptions and OpportunitiesTechnology Trends, Disruptions and Opportunities
Technology Trends, Disruptions and Opportunities
 
Certificate_DataStax_Cassandra
Certificate_DataStax_CassandraCertificate_DataStax_Cassandra
Certificate_DataStax_Cassandra
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Exploring Github Data with Apache Drill on ARM64

  • 1. Exploring GitHub data with Apache Drill on Arm64 Ganesh Raju Naresh Bhat
  • 2. Who Are We Anyway?
  • 3. What is Linaro: Leading collaboration in the ARM ecosystem
  • 4. Apache Drill Open source distributed SQL query engine for non-relational datastores - JSON document model - Columnar Key Advantages - Columnar - Schema on the fly - Integrates with any non-relational datastore - Elastic scalability - Data can be treated like SQL Tables - SQL like query syntax - No overhead (creating and maintaining schemas, ETL process, etc ) - Vectorization (SIMD instructions)
  • 5. Apache Drill on Arm64 Server
  • 6. Test environment - SW basic configuration Architecture Gigabyte Marvell® ThunderX2® "Saber" 3 node cluster OS platform Debian GNU/Linux 9.9 (stretch) Linux Kernel version Debian 4.16.13.linaro.290-1 GCC version gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 GlibC version Debian GLIBC 2.24-11+deb9u4 JAVA version openjdk version "1.8.0_191" OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_191-b12) OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.191-b12, mixed mode) Hadoop version Hadoop 2.8.5 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8 Compiled by jdu on 2018-09-10T03:32Z Compiled with protoc 2.5.0 Using upstream release packages from apache.org. Running on commercially available Arm server based on Marvell ThunderX2.
  • 7. Test environment - SW basic configuration Zookeeper and libzookeeper-java version 3.4.9-3+deb9u2 Apache Drill version v1.16.0 Jupyter Notebook version Dataset 3 TB+ of github activity dataset contains a full snapshot . The content is more than 2.8 million open source GitHub repositories. Which includes more than 145 million unique commits Can replicate this demo using upstream release packages and open source data set. jupyter core 4.5.0 jupyter-notebook 6.0.1 qtconsole 4.5.4 ipython 7.7.0 ipykernel 5.1.2 jupyter client 5.3.1 jupyter lab 1.0.9 nbconvert 5.6.0 ipywidgets 7.5.1 nbformat 4.4.0 traitlets 4.3.2
  • 8. Select * from drillbits;
  • 10. Top projects this year in Github Need to paste Apache Drill query snapshot
  • 11. Top contributors by year Need to paste Apache Drill query snapshot
  • 12. Top contributors to Linux by year Need to paste Apache Drill query snapshot
  • 13. Top contributors to Bigdata (Hadoop, Spark, HBase, Hive, drill, etc) Need to paste Apache Drill query snapshot
  • 14. Contributors by Country SELECT * FROM dfs.`/usersummary/*.json` limit 20
  • 15. Language Popularity Score SELECT * FROM dfs.`/usersummary/*.json` limit 20
  • 16. Top Python repositories by their commits count SELECT * FROM dfs.`/usersummary/*.json` limit 20
  • 17. Top Apache Projects by contribution Need to paste Apache Drill query snapshot
  • 18. Who Are We Anyway? We are Linaro: Leading collaboration in the Arm ecosystem
  • 19. Linaro: Open Source Delivering high value collaboration Top 5 company contributor in Linux kernel Contributor to >70 open source projects; many maintained by Linaro engineers Company 4.8-4.13 Changesets % 1 Intel 10,833 13.1% 2 Red Hat 5,965 7.2% 3 Linaro 4,636 5.6% Source: Linux Kernel Development Report, Linux Foundation Selected projects Linaro contributes to
  • 20. Linaro: BigData Objective ● Ensure that Arm is a first class platform for Hadoop and Spark. ● Profile Hadoop and Spark for real world workloads on 64-bit Arm server systems. ● Ensure that OpenJDK is running optimally against Hadoop and Spark workloads.
  • 21. ❏ Founded in November 1990 ❏ Designs the RISC processor cores ❏ Licenses Arm core designs to partners who fabricate and sell to their customers
  • 22. Arm Ecosystem momentum continues to accelerate www.arm.com Workloads Networking Virtualization & Containers Language & Library Operating system
  • 23. COMPANY FOUNDED 1995 FY19 REVENUE $2.9B EMPLOYEES 6,000+ LOCATED IN Santa Clara, CA R&D CENTERS US, Israel, India, Germany, China PATENTS WORLDWIDE 10,000+ 23 Marvell © 2019 Marvell Confidential, All Rights Reserved.
  • 24. 24© 2019 Marvell Confidential, All Rights Reserved. • Up to 32 custom Armv8.1 cores, up to 2.5GHz • Full Out-of-Order, 1, 2, 4 threads per core • 1S and 2S Configuration • Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC • Up to 56 lanes of PCIe Gen3, 14 PCIe controllers ThunderX2 Second Generation High-End Armv8-A Server SoC
  • 25. 25 Marvell powers the world’s fastest Arm-based supercomputer Driven by 145,152 (5,184 CPUs x 28 cores) ThunderX2 cores Securing U.S. nuclear arsenal © 2019 Marvell Confidential, All Rights Reserved.
  • 26. Marvell-University of Michigan Partnership Built on Cavium/Marvell-Michigan relationship Deploy ThunderX for Big Data ● 4800 Cores ● 25 TB Memory ● 40 & 100 Gbps networking ● 3 PB Hadoop File System Accelerating the software ecosystem for data science for Arm. Directly consuming Linaro Big Data software builds We bring an advanced user base in the data science domain
  • 27. Questions ? Contact Us: Ganesh Raju ganesh.raju@linaro.org Naresh Bhat nareshb@marvell.com naresh.bhat@linaro.org Blogpost https://nbhatlinaro.blogspot.com/2019/04/apache-drill-on-arm64.html Thanks to Linaro Team: Yuqi Gu Jun He Guodong Xu Inspiration from Felipe Hoffa’s talks on Google BigQuery https://s3.amazonaws.com/connect.linaro.org/bkk19/presentations/bkk19- 300k1.pdf