Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Introduction to HCFS

Introduction to Hadoop Compatible File System
2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Introduction to HCFS

  1. 1. HCFS 初探 Introduction to Hadoop Compatible File System Jazz Yao-Tsung Wang Co-founder of Hadoop.TW https://fb.com/groups/hadoop.tw 2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017
  2. 2. HELLO! I am Jazz Wang Co-Founder of Hadoop.TW. Hadoop Evangelist since 2008. Open Source Promoter. System Admin (Ops). You can find me at @jazzwang_tw or https://fb.com/groups/hadoop.tw , https://forum.hadoop.tw
  3. 3. 1. What is HCFS? Let’s start with brief introduction to Apache Hadoop
  4. 4. Apache Hadoop from 0.x to 1.x Master Worker #1 Worker #2 Worker #3 NameNode DataNode DataNode DataNode DataNode Job Tracker Task Tracker Task Tracker Task Tracker Task TrackerComputation Layer MapReduce Storage Layer HDFS
  5. 5. Master Worker #1 Worker #2 Worker #3 NameNode DataNode DataNode DataNode DataNode Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer HDFS Apache Hadoop from 2.x to 3.x Container
  6. 6. Needs / Trends: Hadoop on the Cloud http://www.slideshare.net/jazzwang/hadoop-deployment-model-osdctw
  7. 7. Why Hadoop on the Cloud ? http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production https://www.youtube.com/watch?v=XehH3iJJy3Q
  8. 8. Why might you need HCFS ... https://www.facebook.com/groups/hadoop.tw/permalink/1061706333938741/?comment_id=1072414466201261&reply _comment_id=1073302882779086&comment_tracking={%22tn%22%3A%22R%22}
  9. 9. http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production https://www.youtube.com/watch?v=XehH3iJJy3Q Spark / Hive / Impala ...
  10. 10. “ https://aws.amazon.com/lambda/ https://cloud.google.com/functions/ http://www.forbes.com/sites/janakirammsv/2016/02/09/google-brings-serverless-computing-to-its-cloud-platform/#76e1aa9425b8 Docker Microservice Serverless NoOps !?! $$$
  11. 11. Master Worker #1 Worker #2 Worker #3 Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer HCFS What is HCFS ? Windows Azure Blob AWS S3 Google Cloud Storage CephFS Hadoop Compatible File System
  12. 12. HCFS implementations - Cloud Storage Connector ( for Public Cloud Provider ) https://wiki.apache.org/hadoop/HCFS AWS S3 s3:// Hadoop 0.10 ~ Hadoop 2.7 https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/ s3n:// Hadoop 0.18 ~ Hadoop 2.6 s3a:// Hadoop 2.7+ AWS EMRFS ?? 3rd party http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html Windows Azure Storage Blob wasb:// Hadoop 2.7+ http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/ https://issues.apache.org/jira/browse/HADOOP-9629 Azure Data Lake adl:// Hadoop 3.0+ https://hadoop.apache.org/docs/current/hadoop-azure-datalake/ https://docs.microsoft.com/zh-tw/azure/data-lake-store/data-lake-store-h dinsight-hadoop-use-portal Google Cloud Storage gs:// 3rd party Hadoop 1.x Hadoop 2.x https://cloud.google.com/hadoop/google-cloud-storage-connector https://github.com/GoogleCloudPlatform/bigdata-interop
  13. 13. HCFS implementations ( for Private Cloud Provider ) OpenStack Swift ( rackspace ) swift:// Hadoop 2.7+ https://issues.apache.org/jira/browse/HADOOP-8545 http://hadoop.apache.org/docs/r2.7.3/hadoop-openstack/ https://github.com/steveloughran/Hadoop-and-Swift-integration/ CephFS ( OpenStack ) ceph:// 3rd party Hadoop 1.1.x http://docs.ceph.com/docs/master/cephfs/hadoop/ https://github.com/houbin/cephfs-hadoop Cassandra File System cfs:// 3rd party http://www.datastax.com/dev/blog/cassandra-file-system-design http://www.datastax.com/resources/whitepapers/hdfs-vs-cfs GlusterFS glusterfs:/// 3rd party https://github.com/gluster/glusterfs-hadoop https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Hadoop/ OrangeFS 3rd party Hadoop 1.2.1 Hadoop 2.6.0 http://docs.orangefs.com/v_2_8_8/index.htm#Hadoop_Client.htm http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm QFS ( KFS ) qfs:// 3rd party https://github.com/quantcast/qfs/wiki/Migration-Guide Lustre 3rd party http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre MapR File System 3rd party https://www.mapr.com/products/mapr-fs https://community.mapr.com/thread/7027
  14. 14. HCFS Architecture http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production https://www.youtube.com/watch?v=XehH3iJJy3Q New API
  15. 15. https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169 http://www.slideshare.net/jazzwang/hadoop-69818883
  16. 16. https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169 http://www.slideshare.net/jazzwang/hadoop-69818883 AWS S3 Authentication Support Azure Blob support encrypted Key CephFS is not work well with YARN because of JNI (Java Native Interface) :( Only HDFS and Azure Blob support HBase !!
  17. 17. 2. AWS S3 Use Case : Amazon EMR
  18. 18. Three generation of S3 support s3:// s3n:// s3a:// The ‘classic’ s3: filesystem The second-generation, s3n: filesystem, making it easy to share data between hadoop and other applications via the S3 object store The third generation, s3a: filesystem. replacement for s3n:, supports larger files and promises higher performance. introduced in Hadoop 0.10.0 (HADOOP-574) deprecated and will be removed from Hadoop 3.0 introduced in Hadoop 0.18.0 (HADOOP-930) rename support in Hadoop 0.19.0 (HADOOP-3361) Hadoop 2.6 and earlier introduced in Hadoop 2.6.0 (HADOOP-11571) recommended for Hadoop 2.7 and later Uploaded files can be larger than 5GB, but they are not interoperable with other S3 tools. requires a compatible version of jets3t requires exact version of amazon-aws-sdk core-site.xml core-site.xml core-site.xml <property> <name>fs.s3.awsAccessKeyId</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>AWS secret key</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>AWS secret key</value> </property> <property> <name>fs.s3a.access.key</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>AWS secret key</value> </property> https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  19. 19. 1. You cannot use S3 as a replacement for HDFS 2. Amazon S3 is an "object store" ▸ eventual consistency ▸ non-atomic rename and delete operations. 3. Your AWS credentials are valuable ▸ core-site.xmlis readable in cluster-wide ▸ Don’t use embedding the credentials in the URI ▸ S3A supports more authentication mechanisms 4. Amazon's EMR Service is based upon Apache Hadoop, but contains modifications and their own, proprietary, S3 client. WARNING!! https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  20. 20. For Mac OS X + brew install hadoop export HADOOP_CONF_DIR=${PATH of core-site.xml) export HADOOP_CLASSPATH=/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/* hadoop fs -ls s3n://${bucket}/ For Linux / Windows - use BigTop docker image docker run -it --name hcfs -h hcfs -v $(pwd):/data jazzwang/bigtop-hdfs # cd /data /data# export HADOOP_CONF_DIR=${PATH of core-site.xml) /data# hadoop fs -ls s3n://${bucket}/ DEMO https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  21. 21. To enable more log4j messages, you could try : export HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -ls s3n://${bucket}/ To access unofficial S3 services such as hicloud S3 and Ceph S3 (RGW) Using s3n:// , you have to put a config file jets3t.properties $ cat jets3t.properties s3service.s3-endpoint=s3.hicloud.net s3service.https-only=false Using s3a:// , you could add following to core-site.xml <property> <name>fs.s3a.endpoint</name> <value>s3.hicloud.net</value> <description>default is s3.amazonaws.com</description> </property> Undocumented Secrets 除錯/繞道密技
  22. 22. 3. Windows Azure Storage Blob Use Case : HDInsight / Azure Data Lake
  23. 23. 1. hadoop-azure.jar is located at - /usr/lib/hadoop-mapreduce/hadoop-azure.jar (bigtop , CDH) - ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure.jar ( official tar.gz , Mac brew) 2. Depends on Azure Storage SDK for Java - https://github.com/Azure/azure-storage-java 3. Features ▸ Supports configuration of multiple Azure Blob Storage accounts. ▸ Supports both page blobs and block blobs ▸ wasbs:// scheme for SSL encrypted access. ▸ Can act as a source of data in a MapReduce job, or a sink. ▸ Tested on both Linux and Windows. 4. Limitation ▸ The append operation is not implemented. ▸ File owner and group are persisted, but the permissions model is not enforced. ▸ File last access time is not tracked. Hadoop Azure Support: Azure Blob Storage http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
  24. 24. In core-site.xml <property> <name>fs.azure.account.key. youraccount.blob.core.windows.net</name> <value>YOUR ACCESS KEY</value> </property> Examples: > hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir > hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile > hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile Configurations http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
  25. 25. My Use Case : rsync between local and wasb http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html Take advantage of hadoop distcp - Backup hadoop distcp -update ${SOURCE_DIR} wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR} - Restore hadoop distcp wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR} ${RESTOR_DIR} Take Hadoop as a rsync tool to sync with Hybrid Cloud Storage
  26. 26. Use Case in TenMax: Read / Write files from/to Azure Blob Storage Spring Boot FileSystem Web Application File System Abstraction Layer core-site.xml Azure Blob Storage Cloud Storage Take Hadoop as a Java Library to access Hybrid Cloud Storage
  27. 27. 4. Ceph
  28. 28. Master Worker #1 Worker #2 Worker #3 Mon OSD OSD OSD OSD Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer Ceph High Level Architecture of Hadoop 2.x with CephFS Mon Mon
  29. 29. hdfs01 hdfs02 hdfs03 hdfs04 virtual network ( hub ) node11 node21 node31 Ceph mon Ceph OSD Ceph OSD Ceph OSD Ceph OSD Resource Manager Node Manager Node Manager Node Manager
  30. 30. 1. Compile https://github.com/ceph/cephfs-hadoop 2. Copy cephfs-hadoop.jar and place it at ${HADOOP_HOME}/lib/ 3. Copy ceph.conf and ceph.client.${ID}.keyring to /etc/ceph 4. Copy cephfs-java.jar to ${HADOOP_HOME}/lib/ 5. Copy JNI related files to ${HADOOP_HOME}/lib/native/ ln -s libcephfs.so.1 /usr/lib/hadoop/lib/native/libcephfs.so ln -s libcephfs_jni.so.1 /usr/lib/hadoop/lib/native/libcephfs_jni.so CephFS installation http://docs.ceph.com/docs/master/cephfs/hadoop/ https://github.com/ceph/cephfs-hadoop
  31. 31. Known Issue : MRAppMaster can not read find cephfs_jni
  32. 32. Root Cause : There is no -Djava.library.path for MRAppMaster
  33. 33. Root Cause : There is no -Djava.library.path for MRAppMaster
  34. 34. G.G Official Support is limited to Hadoop 1.1.x http://docs.ceph.com/docs/master/cephfs/hadoop/
  35. 35. Why it works for MRv1?? Let’s take a look at MapReduce v1 Architecture
  36. 36. Why doesn’t it work on YARN?? Let’s take a look at YARN Architecture
  37. 37. Without correct configuration, HCFS or YARN Application that use JNI will fail :( http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
  38. 38. WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings. How to solve this issue ? Official document and souce code said so ... https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Native_Shared_Libraries https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-c re/src/main/resources/mapred-default.xml#L267
  39. 39. Conclusion ▸ S3 and WASB are the most mature HCFS. ▹ Sorry taht I’m not sure about Google Cloud Storage :( ▸ You’ll need more integration test for Hadoop Ecosystem when using HCFS. Take Hadoop as a rsync tool to sync with Hybrid Cloud Storage Take Hadoop as a Java Library to access Hybrid Cloud Storage
  40. 40. THANKS! Any questions? You can find me at @jazzwang_tw & https://fb.com/groups/hadoop.tw
  41. 41. CREDITS Special thanks to all the people who made and released these awesome resources for free: ▸ Presentation template by SlidesCarnival ▸ Photographs by Death to the Stock Photo (license) PRESENTATION DESIGN This presentations uses the following typographies and colors: ▸ Titles: Montserrat ▸ Body copy: Karla You can download the fonts on this page: http://www.google.com/fonts/#UsePlace:use/Collection:Montserrat:400,700|Ka rla:400,400italic,700,700italic