Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

IC-SDV 2019: Distributing AI to the Amazon Cloud - Klaus Kater (Deep SEARCH 9, Germany )

284 Aufrufe

Veröffentlicht am

The sheer size and number of websites that DS9 processes for its customers has meant the need for a new approach – one that dynamically scales available hardware resources and bandwidth. Amazon EC2 cloud resources was the platform of choice – and enables us to roll out DS9 installations as needed and execute jobs faster in the cloud. This talk presents the solution and also highlights some of the challenges that had to be overcome during the implementation phase.

Veröffentlicht in: Internet
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

IC-SDV 2019: Distributing AI to the Amazon Cloud - Klaus Kater (Deep SEARCH 9, Germany )

  1. 1. 1 © 2019 Deep SEARCH 9 GmbH1 Deep SEARCH 9 Distributing AI to the Amazon cloud IC-SDV 2019 08 - 09 April Nice, France Klaus Kater Deep SEARCH 9 GmbH Managing Partner https://deepsearchnine.com
  2. 2. 2 © 2019 Deep SEARCH 9 GmbH2 Sources Surface Web Deep Web Databases Repositories Scheduled execution Unattendedretrieval/crawling Prepare semantic search Automatic publication Deep SEARCH 9 Information Consumers Ontology management SEARCHCORPORA • Biotech • CROs • Digital Therapeutics • Technology Transfer Offices • Clinical trials • Other scopes of information • Known (trusted) sources • More complete • Faster Search applications for specific scopes of information
  3. 3. 3 © 2019 Deep SEARCH 9 GmbH3 Why moving to the cloud? DS9 needs more and more resources… 2015 2016 2017 2018 2019 • 30.000 company websites • Link depth 3 • Once every 3 months • ca. 50 GB of data • 60.000 company websites • Link depth 5 • Every month • ca. 1 TB of data …because our search engines keep gobbling information like the cookie monster gobbles cookies! • 250.000 company websites • Link depth 5 • Twice a month • ?
  4. 4. 4 © 2019 Deep SEARCH 9 GmbH4 Therefore we need: The only place to get all of this, is the cloud! More CPU power Content classification Semantic tagging Machine learning Faster networks High bandwidth requirements Network latency problems Scalability Availability and responsiveness for users CPU during analysis Bandwidth during crawling
  5. 5. 5 © 2019 Deep SEARCH 9 GmbH5 More CPU power
  6. 6. 6 © 2019 Deep SEARCH 9 GmbH6 More CPU power EC2 Dynamic Scaling price per hour hours yearly budget EC2 r5.4xlarge + 2 TB SSD 1,22 € 8.250 10.065 € Bare metal hardware price per month hours yearly budget Bare metal hardware 839,00 € 8.760 10.068 €
  7. 7. 7 © 2019 Deep SEARCH 9 GmbH7 More CPU power But we need to be able to do the job in about 2 days EC2 Runtime Compared to bare metal server Budget (year) Concurrent DS9 nodes Hours / day Hours / month Hours / year EC2 10 instances 10.065 € 10 2 69 825 EC2 20 instances 10.065 € 20 1 34 413 EC2 50 instances 10.065 € 50 - 14 165 EC2 100 instances 10.065 € 100 - 7 83 EC2 Dynamic Scaling price per hour hours yearly budget EC2 r5.4xlarge + 2 TB SSD 1,22 € 8.250 10.065 € Bare metal hardware price per month hours yearly budget Bare metal hardware 839,00 € 8.760 10.068 € 20x as much CPU for the same price!
  8. 8. 8 © 2019 Deep SEARCH 9 GmbH8 Next bullet point: Faster networks Viewers show the global distribution of companies in our SEARCHCORPORA Obviously there are many activities in Japan (JPN), India (IND), China (CHN), Korea (KOR), Hong Kong (HKG), Iran (IRN), Pakistan (PAK), Taiwan (TWN), Malaysia (MYS), Bangladesh (BGD), Singapore (SGP), …
  9. 9. 9 © 2019 Deep SEARCH 9 GmbH9 Faster networks Note, how Tokyo and Seattle have the same distance to our servers (9.300 km) as have Boston and New Delhi (6.000 km) but network latency is much higher going east or south Ping time from DS9 server Circles are simply squeezed to compensate for Mercator distortion
  10. 10. 10 © 2019 Deep SEARCH 9 GmbH10 Faster networks But can we make the network connection faster? Simple calculation Typical page: 30 kB Typical webserver: 500 pages Transferring 1 page from Tokyo: 1.200ms 500 pages: 500 x 1.200ms = 10 minutes 1.000 servers: 6 days 23 hours From Tokyo Transferring 1 page from London: 82ms 500 pages: 500 x 82ms = 41 seconds 1.000 servers: 11,5 hours From London
  11. 11. 11 © 2019 Deep SEARCH 9 GmbH11 No. But we could distribute DS9! We can distribute DS9 instances across the world using the Amazon cloud This map shows the Amazon EC2 computer center locations
  12. 12. 12 © 2019 Deep SEARCH 9 GmbH12 Distributing DS9 We can distribute DS9 instances across the world using the Amazon cloud This map shows the Amazon EC2 computer center locations
  13. 13. 13 © 2019 Deep SEARCH 9 GmbH13 Challenges Use standards or develop proprietary? Hadoop is what one thinks of when hearing distributed analytics… MapReduce algorithms are good at distributing cut down analytics tasks across multiple CPUs. This is what we would use on the filter step level. But it is not suited to distribute whole filter chains with arbitrary analytics tasks like text annotation with ontologies or Deep Web crawling with real-time constructed URLs How can we minimize I/O operations? I/O operations – especially indexing of data – and data transfer are the bottlenecks and could potentially eat up all benefits coming from distribution 1. Data must be read only once from the DS9 backend (no copying) 2. Data must be transferred in compressed chunks (to overcome latency issues) 3. Data must be indexed only once at the final destination on the DS9 backend
  14. 14. 14 © 2019 Deep SEARCH 9 GmbH14 DS9 standard node Distributing DS9 instances ds9App Frontdoor • Webserver Firewall Browserfarm • DS9 • MySQL • Elasticsearch • Blazegraph • DS9 Farming • MySQL • DS9 App • MySQL Frontdoor • Webserver Firewall • DS9 • MySQL • Blazegraph DS9 distributed node Smaller footprint!
  15. 15. 15 © 2019 Deep SEARCH 9 GmbH15 Two new types of DS9 jobs were implemented: That‘s what we always did Execute a job from main DS9 host remotely on some other DS9 host for load distribution Execute a job from main DS9 host on a dynamically allocated cluster of EC2 instances that have DS9 Solutions installed Controlled by DS9 Farming URLs read from DS9 main host Results written back to DS9 main host Start 20 nodes in DS9 cluster mode Use t3.xlarge node type (4 VCPUs, 96GB) Run all instances at Amazon in Tokyo DS9 EC2 cloud clusters
  16. 16. 16 © 2019 Deep SEARCH 9 GmbH16 ds9App Frontdoor • Webserver Firewall Browserfarm DS9 / IDE DS9 standard installation Instances are dynamically allocated, deployed and started, jobs are executed and at the end all instances are terminated Accounting Dynamic cloud allocation powered by• DS9 • MySQL • Elasticsearch • Blazegraph • DS9 Farming • MySQL • DS9 App • MySQL • DS9 • MySQL • Blazegraph Each node is a full installation of DS9 Solutions (without Elasticsearch) Finally fully scalable (this satisfies our 3rd need) AWS Region Tokyo 20x – deployment takes < 5 minutes
  17. 17. 17 © 2019 Deep SEARCH 9 GmbH17 1. export DS9 Farming 2. unpack Claim containers input powered by DS9 Solutions • DS9 • MySQL • Blazegraph DS9 Solutions • DS9 • MySQL • Blazegraph DS9 Solutions • DS9 • MySQL • Blazegraph DS9 Solutions • DS9 • MySQL • Blazegraph 3. start nodes 4. import job 5. execute job remote read equally distribute URLs among EC2 nodes write cache remote write Only move necessary resources to EC2 Execute Distributed Job 6. stop nodes …
  18. 18. 18 © 2019 Deep SEARCH 9 GmbH18 Sources Information Scientists SEARCHCORPORA • Start-ups • Competitors • Regulatory • New technology • … Scheduled execution Unattendedupdates Automatic publication • Known (trusted) sources • More complete • Faster Managed Intelligence 2018 • Information source selection • Content structuring • Linking of disparate sources • Ontology management • SEARCHCORPUS® management Search Competence Center Information Consumers Internal Customers Expertise of information scientist needed Unattended automatic execution of jobs Sources Surface Web Deep Web Databases Repositories Prepare semantic search Ontology management
  19. 19. 19 © 2019 Deep SEARCH 9 GmbH19 Company repositories e.g. Crunchbase Master SEARCHCORPUS® • Hundreds of thousands of websites • Many Million web pages • PDF-based publications • Structured data • Other sources Extraction using Lucene query + classification SEARCHCORPORA • Biotech • CROs • Digital Therapeutics • Technology Transfer Offices • Clinical trials • Other scopes of information Customize for research target Automatic publication: • target specific focus Information Consumers Internal CustomersQuality assurance Qualification SEARCHCORPUS® Crawling and automatic classification for classes of interest Classified targets Managed Intelligence 2019 Fully distributed Expertise of information scientist needed Crawling Unattended automatic execution of jobs Distributed automatic execution of jobs Information Scientists Search Competence Center Surface / Deep Web
  20. 20. 20 © 2019 Deep SEARCH 9 GmbH20 Deep SEARCH 9 Distributing AI to the Amazon cloud IC-SDV 2019 08 - 09 April Nice, France Klaus Kater Deep SEARCH 9 GmbH Managing Partner https://deepsearchnine.com

×