SlideShare a Scribd company logo
1 of 12
Hadoop Record Reader in Python HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record
Hey Jute… Tabs and newlines are good and all For lots of data, don’t do that
don’t make it bad... Hadoop has a native data storage format called Hadoop Record or “Jute” org.apache.hadoop.record http://en.wikipedia.org/wiki/Jute
take a data structure… There is a Data Definition Language! module links { 		class Link { ustringURL; booleanisRelative; ustringanchorText; 		}; }
and make it better… And a compiler $ rcc -lc++ inclrec.jrtestrec.jr 	namespace inclrec { 		class RI : 		public hadoop::Record { 		    private: 			int32_t I32; 			double D; std::string S;
remember, to only use C++/Java $rcc--help 	Usage: rcc --language [java|c++] ddl-files
then you can start to make it better… I wanted it in python Need 2 parts.  Parsing library and  DDL translator I only did the first part If you need second part, let me know
Hey Jute don't be afraid…
you were made to go out and get her… http://github.com/ptarjan/hadoop_record
the minute you let her under your skin… I bet you thought I was done with “Hey Jude” references, eh? How I built it Ply == lex and yacc Parser == 234 lines including tests! Outputs generic data types You have to do the class transform yourself You can use my lex and yacc stuff in your language of choice
and any time you feel the pain… Parsing the binary format is hard Vector vsstruct??? struct= "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}" LazyString – don’t decode if not needed 99% of my hadoop time was decoding strings I didn’t need Binary on disk -> CSV -> python == wastefull Hadoopupacks zip files – name it .mod
nanananana Future work DDL Converter Integrate it officially Record writer (should be easy) SequenceFileAsOutputFormat Integrate your feedback

More Related Content

What's hot

sphinx-i18n — The True Story
sphinx-i18n — The True Storysphinx-i18n — The True Story
sphinx-i18n — The True StoryRobert Lehmann
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonHubert Piotrowski
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoopFrank Y
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSAndrey Kudryavtsev
 
Meetup Elasticsearch 13 novembre 2014
Meetup Elasticsearch 13 novembre 2014Meetup Elasticsearch 13 novembre 2014
Meetup Elasticsearch 13 novembre 2014Jean-Pierre Paris
 
Geo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXGeo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXLuis Bermudez
 
Working with Shared Libraries in Perl
Working with Shared Libraries in PerlWorking with Shared Libraries in Perl
Working with Shared Libraries in PerlIdo Kanner
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 

What's hot (20)

sphinx-i18n — The True Story
sphinx-i18n — The True Storysphinx-i18n — The True Story
sphinx-i18n — The True Story
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and Python
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCLVisualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
 
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 dataUsage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOS
 
Meetup Elasticsearch 13 novembre 2014
Meetup Elasticsearch 13 novembre 2014Meetup Elasticsearch 13 novembre 2014
Meetup Elasticsearch 13 novembre 2014
 
Using HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py moduleUsing HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py module
 
Tokyocabinet
TokyocabinetTokyocabinet
Tokyocabinet
 
Geo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXGeo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDX
 
Working with Shared Libraries in Perl
Working with Shared Libraries in PerlWorking with Shared Libraries in Perl
Working with Shared Libraries in Perl
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Substituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scriptsSubstituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scripts
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)
 
anticorrp
anticorrpanticorrp
anticorrp
 
Pybind11 - SciPy 2021
Pybind11 - SciPy 2021Pybind11 - SciPy 2021
Pybind11 - SciPy 2021
 

Viewers also liked

Semantic Searchmonkey
Semantic SearchmonkeySemantic Searchmonkey
Semantic SearchmonkeyPaul Tarjan
 
How To Be A Hacker
How To Be A HackerHow To Be A Hacker
How To Be A HackerPaul Tarjan
 
Hacku Intro 2009
Hacku Intro 2009Hacku Intro 2009
Hacku Intro 2009Paul Tarjan
 
Yahoo! HackU 2010
Yahoo! HackU 2010Yahoo! HackU 2010
Yahoo! HackU 2010Paul Tarjan
 
Soleus Audio Manager Help
Soleus Audio Manager HelpSoleus Audio Manager Help
Soleus Audio Manager HelpChris CHOU
 
Yahoo Developer Network overview
Yahoo Developer Network overviewYahoo Developer Network overview
Yahoo Developer Network overviewChristian Heilmann
 
Trompe L’Oeil & Decorazioni Pignotti Pisanu
Trompe L’Oeil & Decorazioni Pignotti PisanuTrompe L’Oeil & Decorazioni Pignotti Pisanu
Trompe L’Oeil & Decorazioni Pignotti Pisanuguest79d1a6
 
Promoting Excellence Network - Graduate Attributes at CQUniversity Australia
Promoting Excellence Network - Graduate Attributes at CQUniversity AustraliaPromoting Excellence Network - Graduate Attributes at CQUniversity Australia
Promoting Excellence Network - Graduate Attributes at CQUniversity AustraliaDamien Clark
 

Viewers also liked (10)

Semantic Searchmonkey
Semantic SearchmonkeySemantic Searchmonkey
Semantic Searchmonkey
 
Hands on Hadoop
Hands on HadoopHands on Hadoop
Hands on Hadoop
 
How To Be A Hacker
How To Be A HackerHow To Be A Hacker
How To Be A Hacker
 
Hacku Intro 2009
Hacku Intro 2009Hacku Intro 2009
Hacku Intro 2009
 
Yahoo! HackU 2010
Yahoo! HackU 2010Yahoo! HackU 2010
Yahoo! HackU 2010
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Soleus Audio Manager Help
Soleus Audio Manager HelpSoleus Audio Manager Help
Soleus Audio Manager Help
 
Yahoo Developer Network overview
Yahoo Developer Network overviewYahoo Developer Network overview
Yahoo Developer Network overview
 
Trompe L’Oeil & Decorazioni Pignotti Pisanu
Trompe L’Oeil & Decorazioni Pignotti PisanuTrompe L’Oeil & Decorazioni Pignotti Pisanu
Trompe L’Oeil & Decorazioni Pignotti Pisanu
 
Promoting Excellence Network - Graduate Attributes at CQUniversity Australia
Promoting Excellence Network - Graduate Attributes at CQUniversity AustraliaPromoting Excellence Network - Graduate Attributes at CQUniversity Australia
Promoting Excellence Network - Graduate Attributes at CQUniversity Australia
 

Similar to Hadoop Jute Record Python

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Lamp1
Lamp1Lamp1
Lamp1Reka
 
Lamp
LampLamp
LampReka
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09Bastian Feder
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 

Similar to Hadoop Jute Record Python (20)

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Intro to-puppet
Intro to-puppetIntro to-puppet
Intro to-puppet
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Lamp1
Lamp1Lamp1
Lamp1
 
Lamp1
Lamp1Lamp1
Lamp1
 
Lamp
LampLamp
Lamp
 
HCatalog
HCatalogHCatalog
HCatalog
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installation
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Pig
PigPig
Pig
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 

Recently uploaded

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Hadoop Jute Record Python

  • 1. Hadoop Record Reader in Python HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record
  • 2. Hey Jute… Tabs and newlines are good and all For lots of data, don’t do that
  • 3. don’t make it bad... Hadoop has a native data storage format called Hadoop Record or “Jute” org.apache.hadoop.record http://en.wikipedia.org/wiki/Jute
  • 4. take a data structure… There is a Data Definition Language! module links { class Link { ustringURL; booleanisRelative; ustringanchorText; }; }
  • 5. and make it better… And a compiler $ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D; std::string S;
  • 6. remember, to only use C++/Java $rcc--help Usage: rcc --language [java|c++] ddl-files
  • 7. then you can start to make it better… I wanted it in python Need 2 parts. Parsing library and DDL translator I only did the first part If you need second part, let me know
  • 8. Hey Jute don't be afraid…
  • 9. you were made to go out and get her… http://github.com/ptarjan/hadoop_record
  • 10. the minute you let her under your skin… I bet you thought I was done with “Hey Jude” references, eh? How I built it Ply == lex and yacc Parser == 234 lines including tests! Outputs generic data types You have to do the class transform yourself You can use my lex and yacc stuff in your language of choice
  • 11. and any time you feel the pain… Parsing the binary format is hard Vector vsstruct??? struct= "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}" LazyString – don’t decode if not needed 99% of my hadoop time was decoding strings I didn’t need Binary on disk -> CSV -> python == wastefull Hadoopupacks zip files – name it .mod
  • 12. nanananana Future work DDL Converter Integrate it officially Record writer (should be easy) SequenceFileAsOutputFormat Integrate your feedback