SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
Automatic
  Scheduled
Loading of CCK
    Nodes
ETL with drupal_execute, OO,
        drush, & cron



     David Naughton | December 3, 2008
Who am I?
David Naughton


●
    Web Applications Developer
       ●
           University of Minnesota Libraries
       ●
           naughton@umn.edu
●
    11+ years development experience
●
    New to Drupal & PHP
What's EthicShare?
ethicshare.org

• Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE

• What: A sustainable aggregation of bioethics research and a forum for scholarship

• When: Pilot Phase January 2008 – June 2009

• How: Funded by Andrew W. Mellon Foundation
Sustainable Aggregation
of Bioethics Research
• My part of the project

• Extract citations from multiple sources

• Transform into Drupal-compatible format

• Load into Drupal

• On a regular, ongoing basis
ETL...
• Extract, Transform, and Load = ETL
• Very common IT problem
• ETL is the most common term for it
• Librarians like to say...
   • “Harvesting” instead of Extracting
   • “Crosswalking” instead of Transforming
• ...but they're peculiar
...ETL
• Complex problem
• Lots of packaged solutions
   • Mostly Java, for data warehouses
• Not a good fit for EthicShare
   • Using Drupal 5 and CCK
   • No Batch API
• When we move to Drupal 6...
   • Batch API http://bit.ly/BatchAPI?
   • content.crud.inc http://bit.ly/content-crud-inc?
Without Automation
• First PubMed load alone was > 100,000
citations
• Without automation, I could have been doing
lots of this:
One Solution
If money were no object, we could have hired
lots of these:
Really want...
...but don't want:
Architecture
                         drush
 Extractors         Transformers

  PubMed      XML    PubMed           CiteETL


  WorlCat     XML    WorlCat


                                                Loader   EthicShare
                                   PHP Array              MySQL
 New York           New York
  Times       XML    Times




    BBC       XML     BBC
drush
A portmanteau of “Drupal shell”.

“…a command line shell and Unix scripting interface for
Drupal, a veritable Swiss Army knife designed to make
life easier for those of us who spend most of our
working hours hacking away at the command prompt.”

 -- http://drupal.org/project/drush
Why drush?
• Very flexible scheduling via cron
●
    Uses php-cli, so no web timeouts
●
    Experimental support for running drush without a
running Drupal web instance
●
    Run tests from the cli with Drush simpletest runner
Why not hook_cron?
• If you're comfortable with cron, flexible scheduling via
hook_cron requires unnecessary extra work
●
    Subject to web timeouts
●
    Runs within a Drupal web instance, so large loads
may affect user experience
drush help
$ cd $drush_dir
$ ./drush.php help
Usage: drush.php [options] <command> <command> ...

Options:
  -r <path>, --root=<path>        Drupal root directory to use
                                  (default: current directory)

 -l <uri> , --uri=<uri>           URI of the drupal site to use (only
                                  needed in multisite environments)
...

Commands:
  cite load               Load data to create new citations.

 help                     View help. Run quot;drush help [command]quot; to view
                          command-specific help.

 pm install               Install one or more modules
drush command help
$ ./drush.php help cite load
Usage: drush.php cite load [options]

Options:
  --E=<extractor class>       Base name of an extractor class, excluding
                              the CiteETL/E/ parent path & '.php'. Required.

 --T=<transformer class>      Base name of an transformer class, excluding
                              the CiteETL/T/ parent path & '.php'. Required.

 --L=<loader class>           Base name of an loader class, excluding the
                              CiteETL/L/ parent path & '.php'. Optional:
                              default is 'Loader'.

 --dbuser=<db username>       Optional: 'cite load' will authenticate the
                              user only if both dbuser & dbpass are present.

 --dbpass=<db password>       Optional: 'cite load' will authenticate the
                              user only if both dbuser & dbpass are present.

 --memory_limit=<memory limit>         Optional: default is 512M.
drush cite load
Example specifying the New York Times – Health
extractor & transformer classes on the cli:

$ ./drush.php cite load --E=NYTHealth 
  --T=NYTHealth --dbuser=$dbuser 
  --dbpass=$dbpass

Allows for flexible, per-data-source scheduling via cron,
a requirement for EthicShare.
php-cli Problems
• PHP versions < 5.3 do not free circular references.
This is a problem when parsing loads of XML: Memory
Leaks With Objects in PHP 5
http://bit.ly/php5-memory-leak
• Still may have to allocate huge amounts of memory to
PHP to avoid “out of memory” errors.
drush API
Undocumented, but simple & http://drupal.org/project/drush
links to some modules that use it. To create a drush
command…
●
    Implement hook_drush_command, mapping cli text to a
callback function name
●
    Implement the callback function
…and optionally…
●
    Implement a hook_help case for your command
drush getopt emulation…
Supports:
●
    --opt=value
●
    -opt or --opt (boolean based on presence or
absence)
Contrary to README.txt, does not support:
●
    -opt value
●
    -opt=value
…drush getopt emulation
• Puts options in an associative array, where keys are the option
names: $GLOBALS['args']['options']
●
    Puts commands (“words” not starting with a dash) in an array:
$GLOBALS['args']['commands']
Quirks:
●
    in cases of repetition (e.g. -opt --opt=value ), last one wins
●
    commands & options can be interspersed, as long as order of
commands is maintained
cite.module example…
function cite_drush_command() {
    $items['cite load'] = array(
     'callback'    => 'cite_load_cmd',
     'description' => t('Load data to create new citations.')
    );
    return $items;
}
…cite.module example…
function cite_load_cmd($url) {

   global $args;
   $options = $args['options'];

   // Batch loading will often require more
   // than the default memory.
   $memory_limit = (
       array_key_exists('memory_limit', $options)
       ? $options['memory_limit']
       : '512M'
   );
   ini_set('memory_limit', $memory_limit);

   // continued on next slide…
…cite.module example
 // …continued from previous slide

   if (array_key_exists('dbuser', $options)
       && array_key_exists('dbpass', $options)) {
       user_authenticate($options['dbuser'], $options['dbpass']);
   }

   set_include_path(
      './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR
      . './' . drupal_get_path('module', 'cite') . '/contrib'
      . PATH_SEPARATOR . get_include_path()
   );

   require_once 'CiteETL.php';
   $etl = new CiteETL( $options );
   $etl->run();

} // end function cite_load_cmd
CiteETL.php…
class CiteETL {

private   $option_property_map = array(
 'E' =>   'extractor',
 'T' =>   'transformer',
 'L' =>   'loader'
);

// Not shown: identically-named accessors for these properties
private $extractor;
private $transformer;
private $loader;
…CiteETL.php…
function __construct($params) {
    // The loading process is the almost always the same...
    if (!array_key_exists('L', $params)) {
        $params['L'] = 'Loader';
    }

    foreach ($params as $option => $class) {
        if (!preg_match('/^(E|T|L)$/', $option)) {
            continue;
        }
        // Naming-convention-based, factory-ish, dynamic
        // loading of classes, e.g. CiteETL/E/NYTHealth.php:
        require_once 'CiteETL/' . $option . '/' . $class . '.php';
        $instantiable_class = 'CiteETL_' . $option . '_' . $class;
        $property = $this->option_property_map[$option];
        $this->$property = new $instantiable_class;
    }
}
…CiteETL.php
function run() {
    // Extractors must all implement the Iterator interface.
    $extractor = $this->extractor();
    $extractor->rewind();
    while ($extractor->valid()) {
        $original_citation = $extractor->current();
        try {
            $transformed_citation = $this->transformer->transform(
                $original_citation
            );
        } catch (Exception $e) {
            fwrite(STDERR, $e->getMessage() . quot;nquot;);
            $extractor->next();
        }
        try {
            $this->loader->load( $transformed_citation );
        } catch (Exception $e) {
            fwrite(STDERR, $e->getMessage() . quot;nquot;);
        }
        $extractor->next();
    }
}
Example E. Base Class…
require_once 'simplepie.inc';

class CiteETL_E_SimplePie implements Iterator {

private $items = array();
private $valid = FALSE;

function __construct($params) {
    $feed = new SimplePie();
    $feed->set_feed_url( $params['feed_url'] );
    $feed->init();
    if ($feed->error()) {
        throw new Exception( $feed->error() );
    }
    $feed->strip_htmltags( $params['strip_html_tags'] );
    $this->items = $feed->get_items();
}

// continued on next slide…
…Example E. Base Class
// …continued from previous slide
function rewind() {
    $this->valid = (FALSE !== reset($this->items));
}

function current() {
    return current($this->items);
}

function key() {
    return key($this->items);
}

function next() {
    $this->valid = (FALSE !== next($this->items));
}

function valid() {
    return $this->valid;
}

} # end class CiteETL_E_SimplePie
Example Extractor
require_once 'CiteETL/E/SimplePie.php';

class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie {

function __construct() {
    parent::__construct(array(
     'feed_url' =>
         'http://www.nytimes.com/services/xml/rss/nyt/Health.xml',
     'strip_html_tags' => array('br','span','a','img')
    ));
}

} // end class CiteETL_E_NYTHealth
Example Transformer…
class CiteETL_T_NYTHealth {

private $filter_pattern;

function __construct() {

    $simple_keywords = array(
        'abortion',
        'advance directives',
        // whole bunch of keywords omitted…
       'world health',
    );
    $this->filter_pattern =
        '/(' . join('|', $simple_keywords) . ')/i';
}

// continued on next slide…
…Example Transformer…
// …continued from previous slide

function transform( $simplepie_item ) {
    // create an array matching the cite CCK content type structure:
    $citation = array();

    $citation['title'] = $simplepie_item->get_title();
    $citation['field_abstract'][0]['value'] =
        $simplepie_item->get_content();
    $this->filter( $citation );

    // lots of transformation ops omitted…

    $categories = $simplepie_item->get_categories();
    $category_labels = array();
    foreach ($categories as $category) {
        array_push($category_labels, $category->get_label());
    }
    $citation['field_subject'][0]['value'] =
        join('; ', $category_labels);

    $this->filter( $citation );
    return $citation;
}
…Example Transformer
// …continued from previous slide

function filter( $citation ) {

    $combined_content =
        $citation['title'] .
        $citation['field_abstract'][0]['value'] .
        $citation['field_subject'][0]['value'];

    if (!preg_match($this->filter_pattern, $combined_content))
    {
        throw new Exception(
            quot;The article 'quot; . $citation['title'] . quot;', id: quot;
            . $citation['source_id']
            . quot; was rejected by the relevancy filterquot;
        );
    }
}
Why not FeedAPI?
• Supports only simple one-feed-field to one-CCK-field
mappings
• Avoid the Rube Goldberg Effect by using the same
ETL system for feeds that use for everything else
Loader
class CiteETL_L_Loader {

function load( $citation ) {
    // de-duplication code omitted…

    $node = array('type' => 'cite');
    $citation['status'] = 1;
    $node_path = drupal_execute(
     'cite_node_form', $citation, $node
    );
    $errors = form_get_errors();
    if (count($errors)) {
        $message = join('; ', $errors);
        throw new Exception( $message );
    }
    // de-duplication code omitted…
}
CCK Auto-loading Resources
• Quick-and-dirty CCK imports
http://bit.ly/quick-dirty-cck-imports
• Programmatically Create, Insert, and Update CCK
Nodes http://bit.ly/cck-import-update
• What is the Content Construction Kit? A View from the
Database. http://bit.ly/what-is-cck
CCK Auto-loading Problems
• Column names may change from one database
instance to another if other CCK content types with
identical field names already exist.
• drupal_execute bug in Drupal 5 Form API:
   • cannot call drupal_validate_form on the same form
   more than once: http://bit.ly/drupal5-formapi-bug
   • Fixed in Drupal versions > 5
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09Bastian Feder
 
PHP 5.3 Overview
PHP 5.3 OverviewPHP 5.3 Overview
PHP 5.3 Overviewjsmith92
 
SPL: The Missing Link in Development
SPL: The Missing Link in DevelopmentSPL: The Missing Link in Development
SPL: The Missing Link in Developmentjsmith92
 
eZ Publish Cluster Unleashed
eZ Publish Cluster UnleashedeZ Publish Cluster Unleashed
eZ Publish Cluster UnleashedBertrand Dunogier
 
PuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetPuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetWalter Heck
 
Speed up your developments with Symfony2
Speed up your developments with Symfony2Speed up your developments with Symfony2
Speed up your developments with Symfony2Hugo Hamon
 
4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebookguoqing75
 
mapserver_install_linux
mapserver_install_linuxmapserver_install_linux
mapserver_install_linuxtutorialsruby
 
Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3Fabien Potencier
 
Datagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and BackgridDatagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and Backgrideugenio pombi
 
PHP Data Objects
PHP Data ObjectsPHP Data Objects
PHP Data ObjectsWez Furlong
 

Was ist angesagt? (20)

jQuery secrets
jQuery secretsjQuery secrets
jQuery secrets
 
Puppet @ Seat
Puppet @ SeatPuppet @ Seat
Puppet @ Seat
 
Php go vrooom!
Php go vrooom!Php go vrooom!
Php go vrooom!
 
The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09
 
Augeas @RMLL 2012
Augeas @RMLL 2012Augeas @RMLL 2012
Augeas @RMLL 2012
 
PHP 5.3 Overview
PHP 5.3 OverviewPHP 5.3 Overview
PHP 5.3 Overview
 
SPL: The Missing Link in Development
SPL: The Missing Link in DevelopmentSPL: The Missing Link in Development
SPL: The Missing Link in Development
 
eZ Publish Cluster Unleashed
eZ Publish Cluster UnleashedeZ Publish Cluster Unleashed
eZ Publish Cluster Unleashed
 
Apache Hacks
Apache HacksApache Hacks
Apache Hacks
 
PuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetPuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with Puppet
 
Speed up your developments with Symfony2
Speed up your developments with Symfony2Speed up your developments with Symfony2
Speed up your developments with Symfony2
 
4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook
 
mapserver_install_linux
mapserver_install_linuxmapserver_install_linux
mapserver_install_linux
 
ReUse Your (Puppet) Modules!
ReUse Your (Puppet) Modules!ReUse Your (Puppet) Modules!
ReUse Your (Puppet) Modules!
 
Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3
 
PHP MVC
PHP MVCPHP MVC
PHP MVC
 
Alfredo-PUMEX
Alfredo-PUMEXAlfredo-PUMEX
Alfredo-PUMEX
 
Datagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and BackgridDatagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and Backgrid
 
PHP Data Objects
PHP Data ObjectsPHP Data Objects
PHP Data Objects
 
ReactPHP
ReactPHPReactPHP
ReactPHP
 

Andere mochten auch

Creating Your WordPress Web Site
Creating Your WordPress Web SiteCreating Your WordPress Web Site
Creating Your WordPress Web Sitemythicgroup
 
Mobilizing Communities in a Connected Age
Mobilizing Communities in a Connected AgeMobilizing Communities in a Connected Age
Mobilizing Communities in a Connected AgeMargaret Stangl
 
The Greatest Generation
The Greatest GenerationThe Greatest Generation
The Greatest Generationgpsinc
 
Educational Tear Sheets
Educational Tear SheetsEducational Tear Sheets
Educational Tear Sheetssararshea
 
A Search For Compassion
A Search For CompassionA Search For Compassion
A Search For CompassionMichelle
 
Most Contagious 2008
Most Contagious 2008Most Contagious 2008
Most Contagious 2008Daniel Simon
 
Has Anyone Asked a Customer?
Has Anyone Asked a Customer?Has Anyone Asked a Customer?
Has Anyone Asked a Customer?Dan Armstrong
 
Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)Mahmoud Osman
 
Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009Bart Brouwers
 
TFS REST API e Universal Apps
TFS REST API e Universal AppsTFS REST API e Universal Apps
TFS REST API e Universal AppsGiovanni Bassi
 
Fontys business model generation & dichtbij
Fontys business model generation & dichtbijFontys business model generation & dichtbij
Fontys business model generation & dichtbijBart Brouwers
 
Innovatie bij Traditionele Media
Innovatie bij Traditionele MediaInnovatie bij Traditionele Media
Innovatie bij Traditionele MediaBart Brouwers
 
Lean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitudeLean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitudemiguelvinagre
 
Univ Aizu week10 about computer
Univ Aizu week10 about  computerUniv Aizu week10 about  computer
Univ Aizu week10 about computerI M
 

Andere mochten auch (20)

Justize
JustizeJustize
Justize
 
Obligatoriedad de antecedentes policiales
Obligatoriedad de antecedentes policialesObligatoriedad de antecedentes policiales
Obligatoriedad de antecedentes policiales
 
Creating Your WordPress Web Site
Creating Your WordPress Web SiteCreating Your WordPress Web Site
Creating Your WordPress Web Site
 
Mobilizing Communities in a Connected Age
Mobilizing Communities in a Connected AgeMobilizing Communities in a Connected Age
Mobilizing Communities in a Connected Age
 
The Greatest Generation
The Greatest GenerationThe Greatest Generation
The Greatest Generation
 
Educational Tear Sheets
Educational Tear SheetsEducational Tear Sheets
Educational Tear Sheets
 
A Search For Compassion
A Search For CompassionA Search For Compassion
A Search For Compassion
 
TTB- I Spy
TTB- I SpyTTB- I Spy
TTB- I Spy
 
Most Contagious 2008
Most Contagious 2008Most Contagious 2008
Most Contagious 2008
 
Has Anyone Asked a Customer?
Has Anyone Asked a Customer?Has Anyone Asked a Customer?
Has Anyone Asked a Customer?
 
Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)
 
Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009
 
Native tmg
Native tmgNative tmg
Native tmg
 
It Idea
It IdeaIt Idea
It Idea
 
Amphibians
AmphibiansAmphibians
Amphibians
 
TFS REST API e Universal Apps
TFS REST API e Universal AppsTFS REST API e Universal Apps
TFS REST API e Universal Apps
 
Fontys business model generation & dichtbij
Fontys business model generation & dichtbijFontys business model generation & dichtbij
Fontys business model generation & dichtbij
 
Innovatie bij Traditionele Media
Innovatie bij Traditionele MediaInnovatie bij Traditionele Media
Innovatie bij Traditionele Media
 
Lean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitudeLean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitude
 
Univ Aizu week10 about computer
Univ Aizu week10 about  computerUniv Aizu week10 about  computer
Univ Aizu week10 about computer
 

Ähnlich wie Auto-loading of Drupal CCK Nodes

JUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by exampleJUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by exampleGeoffrey De Smet
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpointwebhostingguy
 
Advanced PHPUnit Testing
Advanced PHPUnit TestingAdvanced PHPUnit Testing
Advanced PHPUnit TestingMike Lively
 
Introducing PHP Latest Updates
Introducing PHP Latest UpdatesIntroducing PHP Latest Updates
Introducing PHP Latest UpdatesIftekhar Eather
 
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)arcware
 
Tools and Tips for Moodle Developers - #mootus16
 Tools and Tips for Moodle Developers - #mootus16 Tools and Tips for Moodle Developers - #mootus16
Tools and Tips for Moodle Developers - #mootus16Dan Poltawski
 
Create a web-app with Cgi Appplication
Create a web-app with Cgi AppplicationCreate a web-app with Cgi Appplication
Create a web-app with Cgi Appplicationolegmmiller
 
CodeIgniter PHP MVC Framework
CodeIgniter PHP MVC FrameworkCodeIgniter PHP MVC Framework
CodeIgniter PHP MVC FrameworkBo-Yi Wu
 
Drupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdfDrupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdfLuca Lusso
 
2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pages2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pagessparkfabrik
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Debugging in drupal 8
Debugging in drupal 8Debugging in drupal 8
Debugging in drupal 8Allie Jones
 
Perl web frameworks
Perl web frameworksPerl web frameworks
Perl web frameworksdiego_k
 
Curscatalyst
CurscatalystCurscatalyst
CurscatalystKar Juan
 

Ähnlich wie Auto-loading of Drupal CCK Nodes (20)

JUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by exampleJUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by example
 
Pecl Picks
Pecl PicksPecl Picks
Pecl Picks
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
 
Advanced PHPUnit Testing
Advanced PHPUnit TestingAdvanced PHPUnit Testing
Advanced PHPUnit Testing
 
Introducing PHP Latest Updates
Introducing PHP Latest UpdatesIntroducing PHP Latest Updates
Introducing PHP Latest Updates
 
Fatc
FatcFatc
Fatc
 
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
 
Tools and Tips for Moodle Developers - #mootus16
 Tools and Tips for Moodle Developers - #mootus16 Tools and Tips for Moodle Developers - #mootus16
Tools and Tips for Moodle Developers - #mootus16
 
Create a web-app with Cgi Appplication
Create a web-app with Cgi AppplicationCreate a web-app with Cgi Appplication
Create a web-app with Cgi Appplication
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
CodeIgniter PHP MVC Framework
CodeIgniter PHP MVC FrameworkCodeIgniter PHP MVC Framework
CodeIgniter PHP MVC Framework
 
Sprockets
SprocketsSprockets
Sprockets
 
Catalyst MVC
Catalyst MVCCatalyst MVC
Catalyst MVC
 
Drupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdfDrupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdf
 
2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pages2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pages
 
Unittests für Dummies
Unittests für DummiesUnittests für Dummies
Unittests für Dummies
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Debugging in drupal 8
Debugging in drupal 8Debugging in drupal 8
Debugging in drupal 8
 
Perl web frameworks
Perl web frameworksPerl web frameworks
Perl web frameworks
 
Curscatalyst
CurscatalystCurscatalyst
Curscatalyst
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Auto-loading of Drupal CCK Nodes

  • 1. Automatic Scheduled Loading of CCK Nodes ETL with drupal_execute, OO, drush, & cron David Naughton | December 3, 2008
  • 2. Who am I? David Naughton ● Web Applications Developer ● University of Minnesota Libraries ● naughton@umn.edu ● 11+ years development experience ● New to Drupal & PHP
  • 3. What's EthicShare? ethicshare.org • Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE • What: A sustainable aggregation of bioethics research and a forum for scholarship • When: Pilot Phase January 2008 – June 2009 • How: Funded by Andrew W. Mellon Foundation
  • 4. Sustainable Aggregation of Bioethics Research • My part of the project • Extract citations from multiple sources • Transform into Drupal-compatible format • Load into Drupal • On a regular, ongoing basis
  • 5. ETL... • Extract, Transform, and Load = ETL • Very common IT problem • ETL is the most common term for it • Librarians like to say... • “Harvesting” instead of Extracting • “Crosswalking” instead of Transforming • ...but they're peculiar
  • 6. ...ETL • Complex problem • Lots of packaged solutions • Mostly Java, for data warehouses • Not a good fit for EthicShare • Using Drupal 5 and CCK • No Batch API • When we move to Drupal 6... • Batch API http://bit.ly/BatchAPI? • content.crud.inc http://bit.ly/content-crud-inc?
  • 7. Without Automation • First PubMed load alone was > 100,000 citations • Without automation, I could have been doing lots of this:
  • 8. One Solution If money were no object, we could have hired lots of these:
  • 11. Architecture drush Extractors Transformers PubMed XML PubMed CiteETL WorlCat XML WorlCat Loader EthicShare PHP Array MySQL New York New York Times XML Times BBC XML BBC
  • 12. drush A portmanteau of “Drupal shell”. “…a command line shell and Unix scripting interface for Drupal, a veritable Swiss Army knife designed to make life easier for those of us who spend most of our working hours hacking away at the command prompt.” -- http://drupal.org/project/drush
  • 13. Why drush? • Very flexible scheduling via cron ● Uses php-cli, so no web timeouts ● Experimental support for running drush without a running Drupal web instance ● Run tests from the cli with Drush simpletest runner
  • 14. Why not hook_cron? • If you're comfortable with cron, flexible scheduling via hook_cron requires unnecessary extra work ● Subject to web timeouts ● Runs within a Drupal web instance, so large loads may affect user experience
  • 15. drush help $ cd $drush_dir $ ./drush.php help Usage: drush.php [options] <command> <command> ... Options: -r <path>, --root=<path> Drupal root directory to use (default: current directory) -l <uri> , --uri=<uri> URI of the drupal site to use (only needed in multisite environments) ... Commands: cite load Load data to create new citations. help View help. Run quot;drush help [command]quot; to view command-specific help. pm install Install one or more modules
  • 16. drush command help $ ./drush.php help cite load Usage: drush.php cite load [options] Options: --E=<extractor class> Base name of an extractor class, excluding the CiteETL/E/ parent path & '.php'. Required. --T=<transformer class> Base name of an transformer class, excluding the CiteETL/T/ parent path & '.php'. Required. --L=<loader class> Base name of an loader class, excluding the CiteETL/L/ parent path & '.php'. Optional: default is 'Loader'. --dbuser=<db username> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --dbpass=<db password> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --memory_limit=<memory limit> Optional: default is 512M.
  • 17. drush cite load Example specifying the New York Times – Health extractor & transformer classes on the cli: $ ./drush.php cite load --E=NYTHealth --T=NYTHealth --dbuser=$dbuser --dbpass=$dbpass Allows for flexible, per-data-source scheduling via cron, a requirement for EthicShare.
  • 18. php-cli Problems • PHP versions < 5.3 do not free circular references. This is a problem when parsing loads of XML: Memory Leaks With Objects in PHP 5 http://bit.ly/php5-memory-leak • Still may have to allocate huge amounts of memory to PHP to avoid “out of memory” errors.
  • 19. drush API Undocumented, but simple & http://drupal.org/project/drush links to some modules that use it. To create a drush command… ● Implement hook_drush_command, mapping cli text to a callback function name ● Implement the callback function …and optionally… ● Implement a hook_help case for your command
  • 20. drush getopt emulation… Supports: ● --opt=value ● -opt or --opt (boolean based on presence or absence) Contrary to README.txt, does not support: ● -opt value ● -opt=value
  • 21. …drush getopt emulation • Puts options in an associative array, where keys are the option names: $GLOBALS['args']['options'] ● Puts commands (“words” not starting with a dash) in an array: $GLOBALS['args']['commands'] Quirks: ● in cases of repetition (e.g. -opt --opt=value ), last one wins ● commands & options can be interspersed, as long as order of commands is maintained
  • 22. cite.module example… function cite_drush_command() { $items['cite load'] = array( 'callback' => 'cite_load_cmd', 'description' => t('Load data to create new citations.') ); return $items; }
  • 23. …cite.module example… function cite_load_cmd($url) { global $args; $options = $args['options']; // Batch loading will often require more // than the default memory. $memory_limit = ( array_key_exists('memory_limit', $options) ? $options['memory_limit'] : '512M' ); ini_set('memory_limit', $memory_limit); // continued on next slide…
  • 24. …cite.module example // …continued from previous slide if (array_key_exists('dbuser', $options) && array_key_exists('dbpass', $options)) { user_authenticate($options['dbuser'], $options['dbpass']); } set_include_path( './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR . './' . drupal_get_path('module', 'cite') . '/contrib' . PATH_SEPARATOR . get_include_path() ); require_once 'CiteETL.php'; $etl = new CiteETL( $options ); $etl->run(); } // end function cite_load_cmd
  • 25. CiteETL.php… class CiteETL { private $option_property_map = array( 'E' => 'extractor', 'T' => 'transformer', 'L' => 'loader' ); // Not shown: identically-named accessors for these properties private $extractor; private $transformer; private $loader;
  • 26. …CiteETL.php… function __construct($params) { // The loading process is the almost always the same... if (!array_key_exists('L', $params)) { $params['L'] = 'Loader'; } foreach ($params as $option => $class) { if (!preg_match('/^(E|T|L)$/', $option)) { continue; } // Naming-convention-based, factory-ish, dynamic // loading of classes, e.g. CiteETL/E/NYTHealth.php: require_once 'CiteETL/' . $option . '/' . $class . '.php'; $instantiable_class = 'CiteETL_' . $option . '_' . $class; $property = $this->option_property_map[$option]; $this->$property = new $instantiable_class; } }
  • 27. …CiteETL.php function run() { // Extractors must all implement the Iterator interface. $extractor = $this->extractor(); $extractor->rewind(); while ($extractor->valid()) { $original_citation = $extractor->current(); try { $transformed_citation = $this->transformer->transform( $original_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . quot;nquot;); $extractor->next(); } try { $this->loader->load( $transformed_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . quot;nquot;); } $extractor->next(); } }
  • 28. Example E. Base Class… require_once 'simplepie.inc'; class CiteETL_E_SimplePie implements Iterator { private $items = array(); private $valid = FALSE; function __construct($params) { $feed = new SimplePie(); $feed->set_feed_url( $params['feed_url'] ); $feed->init(); if ($feed->error()) { throw new Exception( $feed->error() ); } $feed->strip_htmltags( $params['strip_html_tags'] ); $this->items = $feed->get_items(); } // continued on next slide…
  • 29. …Example E. Base Class // …continued from previous slide function rewind() { $this->valid = (FALSE !== reset($this->items)); } function current() { return current($this->items); } function key() { return key($this->items); } function next() { $this->valid = (FALSE !== next($this->items)); } function valid() { return $this->valid; } } # end class CiteETL_E_SimplePie
  • 30. Example Extractor require_once 'CiteETL/E/SimplePie.php'; class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie { function __construct() { parent::__construct(array( 'feed_url' => 'http://www.nytimes.com/services/xml/rss/nyt/Health.xml', 'strip_html_tags' => array('br','span','a','img') )); } } // end class CiteETL_E_NYTHealth
  • 31. Example Transformer… class CiteETL_T_NYTHealth { private $filter_pattern; function __construct() { $simple_keywords = array( 'abortion', 'advance directives', // whole bunch of keywords omitted… 'world health', ); $this->filter_pattern = '/(' . join('|', $simple_keywords) . ')/i'; } // continued on next slide…
  • 32. …Example Transformer… // …continued from previous slide function transform( $simplepie_item ) { // create an array matching the cite CCK content type structure: $citation = array(); $citation['title'] = $simplepie_item->get_title(); $citation['field_abstract'][0]['value'] = $simplepie_item->get_content(); $this->filter( $citation ); // lots of transformation ops omitted… $categories = $simplepie_item->get_categories(); $category_labels = array(); foreach ($categories as $category) { array_push($category_labels, $category->get_label()); } $citation['field_subject'][0]['value'] = join('; ', $category_labels); $this->filter( $citation ); return $citation; }
  • 33. …Example Transformer // …continued from previous slide function filter( $citation ) { $combined_content = $citation['title'] . $citation['field_abstract'][0]['value'] . $citation['field_subject'][0]['value']; if (!preg_match($this->filter_pattern, $combined_content)) { throw new Exception( quot;The article 'quot; . $citation['title'] . quot;', id: quot; . $citation['source_id'] . quot; was rejected by the relevancy filterquot; ); } }
  • 34. Why not FeedAPI? • Supports only simple one-feed-field to one-CCK-field mappings • Avoid the Rube Goldberg Effect by using the same ETL system for feeds that use for everything else
  • 35. Loader class CiteETL_L_Loader { function load( $citation ) { // de-duplication code omitted… $node = array('type' => 'cite'); $citation['status'] = 1; $node_path = drupal_execute( 'cite_node_form', $citation, $node ); $errors = form_get_errors(); if (count($errors)) { $message = join('; ', $errors); throw new Exception( $message ); } // de-duplication code omitted… }
  • 36. CCK Auto-loading Resources • Quick-and-dirty CCK imports http://bit.ly/quick-dirty-cck-imports • Programmatically Create, Insert, and Update CCK Nodes http://bit.ly/cck-import-update • What is the Content Construction Kit? A View from the Database. http://bit.ly/what-is-cck
  • 37. CCK Auto-loading Problems • Column names may change from one database instance to another if other CCK content types with identical field names already exist. • drupal_execute bug in Drupal 5 Form API: • cannot call drupal_validate_form on the same form more than once: http://bit.ly/drupal5-formapi-bug • Fixed in Drupal versions > 5