5. Past - MapReduce & Hadoop Security developer Howl Team Architecture & Development AshutoshChauhan Devaraj Das Alan Gates SushanthSowmyan Mac Yang QE EgilSørensen
6. Howl Motivation Provide a table management layer for Hadoop. This includes: providing a shared schema and data type system across tools (collaboration) providing a table abstraction so users need not worry about where or in what format their data is stored (operability) providing users that have different data processing tools (MR, Pig, Hive), the ability to share data (interoperability) providing a way to define new data storage formats / codecs, etc. (evolvability) Pig Hive Map Reduce Streaming Howl RCFile Sequence File Text File
7. Logical Architecture HowlLoader HowlStorage HowlInputFormat HowOutputFormat CLI Notification HiveMetaStore Client Generated Thrift client Hive MetaStore RDBMS = Added by Howl = Taken from Hive = Taken from Hive and modified by Howl
15. Data Processing A = load ‘/data/rawevents/20100819/data’ as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into ‘data/processedevents/20100819/data’; Sally must be manually informed by Joe data is available, or use Oozie and poll on HDFS A = load ‘rawevents’ using HowlLoader(); B = filter A by date = ‘20100819’ and by bot_finder(zeta) = 0; … store Z into ‘processedevents’ using HowlStorage(“date=20100819”); Oozie will be notified by Howl data is available and can then start the Pig job
16. Data Analysis alter table rawevents add partition 20100819 hdfs://data/processedevents/20100819/data select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by adverstiser_id; select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by adverstiser_id;
17. In summary .. Data pipeline use case written in some combination of Pig and MR (writes data that is stored in fact/dimension model) read by Hive no need to export data from Pig/MR into Hive Tools such as Oozie are able to operate on data based on notifications provided by Howl Collaboration & Interoperability at work!
18. Data evolvability at XYZ Corp Let’s say that XYZ Corp decides to move from text files to RCFile to store its processed data Without Howl Pig scripts have to be changed to store in RCFile Hive table has to be altered to use RCFile All existing data must be restated to RCFile With Howl Howl table must be altered to use RCFile for new partitions Existing data need NOT be restated Operations can decide to compact the data
19. Interfaces HowlInputFormat and HowlOutputFormat for MR HowlLoader and HowlStorage for Pig HowlSerDe for Hive (future) Command line interface that provides DDL (matches Hive DDL) Notification service (format TBD) (future) Java API for tools that need to do bulk operations (future)
20. Pig metadata & data HowlLoader HowlInputFormat Thrift Client HowlInputStorageDriver X Input Format X metadata data Thrift Server HDFS meta store Data Flow Diagram – Reading in Pig
21. Roadmap Initial release Q1 2011 Table abstraction to tools processing data on Hadoop. The ability to read and write data in Pig & Map Reduce. The ability to read data in Hive. Partition pruning so that when a user asks for partitions in a table he can provide a selection predicate that determines which partitions are returned. Integration with Hadoop security, including Howl authenticating and authorizing users. JMX based monitoring Oozie workflow integration (users can submit workflow that talks to Howl) Support for writing data in RCFile, reading data from PigStorage, RCFile, Jute ULT (Yahoo! format) [Growl tool] Hive 0.7 release will contain the Hive MetaStore related changes
22. Roadmap .. contd. V2 and beyond.. Notification (for tools like Oozie) Dynamic partitioning Non-partition filter pushdowns Howl Import/Export tool (under dev) Schema evolution Utilities API (for tools, e.g., Grid replication service, to use Howl easily) Authorization enhancements Details at http://wiki.apache.org/pig/HowlJournal Howl Project in the Apache Incubator Starting the process
23. Some Links About Howl http://wiki.apache.org/pig/Howl Security in Howl http://wiki.apache.org/pig/Howl/HowlAuthentication http://wiki.apache.org/pig/Howl/HowlAuthorizationProposal Sources https://github.com/yahoo/howl Roadmap http://wiki.apache.org/pig/HowlJournal Mailing list howldev@yahoogroups.com ddas@apache.org
25. Hive data metadata Thrift Client Input Format X metadata data Thrift Server HDFS meta store Data Flow Diagram – Reading in Hive
26. Howl InputFormat & InputStorageDriver HowlInputFormat Fundamentally, not a data format A generic input format that users can use to write data format agnostic code Provides database table like semantics Allows for specifying projections, predicates Uses HowlInputStorageDriver underneath HowlInputStorageDriver A wrapper over the underlying input format Converts the underlying record to a generic HowlRecord HowlRecord Implemented as a List of objects
27. Security in Howl User(CLI) – Howl Server Authentication using Kerberos HDFS operations are done as the authenticating user Map/Reduce task – Howl Server Authentication using Howl Delegation Tokens (based on Hadoop’s Delegation Token) Authorization Users can control permissions & group ownership on the Table Uses HDFS permissions to authorize metadata operations New Partitions inherit the table’s permissions and group ownership