Generating dynamic prose in PHP

Presenter Notes

Abstract

Tired of keeping a French and English versions of your Web site up-to-date? The field of natural language generation (NLG) has methods and insights that can help you.

In this talk, I will present my general NLG framework written in PHP, PHP-NLGen (https://github.com/DrDub/php-nlgen) and show an example of how to achieve multi-lingual NLG using it.

NLG is not for the faint of heart but it is a fun and young field with plenty of opportunities for innovation.

About the speaker: Dr Duboue has a PhD in Computer Science from Columbia University (New York) in NLG and five years of corporate research experience (including the IBM Research Watson project).

Presenter Notes

Introduction

Presenter Notes

What is NLG

  • Natural Language Generation is the automatic construction of textual prose, starting either from text (text-to-text generation) or from tabular data (data-to-text generation).
    • printf++
    • intelligent templates
  • When to use it

    • Text Output vs. Graphs
    • Capture generalization across dimensions

Presenter Notes

The Setting

  • From James McKinney (@mckinneyjames): http://budgetplateau.com/
  • From Roberto Rocha (@robroc): Budget Commentary.
  • My contribution: multilingual using text generation.

Presenter Notes

The Example

  • Start from rules
IF culture_feries + culture_dimanche < 30 THEN
  pct_change(reading, -10)

IF culture_spectacles < 50 THEN
  on_strike(employee(maison_culture))
  • Combine the predicates such as on_strike, remove redundant info

  • Transform each predicate and arguments into French or English as required

Presenter Notes

  • Les dépenses à des activités culturelles seront les plus bas dans la ville. Les dépenses dans les soins de la rue seront les plus élevé dans la ville. Les fonctionnères de la piscine interieur et les fonctionnères de la Maison de la Culture vont aller en grêve. Le nombre d'accidents de voiture diminuera de 5 pour cent.

  • The spending in cultural activities will be the lowest within the city. The spending in street care will be the highest within the city. The heated pool employees and the employees at the Maison de la Culture will go on strike. The number of car accidents will decrease by 5 percent.

Presenter Notes

Why NLG

  • We can reason about the text
    • Aggregation
      • The heated pool employees and the employees at the Maison de la Culture will go on strike.
      • The heated pool employees will go on strike. The employees at the Maison de la Culture will go on strike.
    • Subsumption
      • Les dépenses dans les soins de la rue seront les plus élevé dans la province.
      • Les dépenses dans les soins de la rue seront les plus élevé dans la ville. Les dépenses dans les soins de la rue seront les plus élevé dans la province.

Presenter Notes

  • Ease of Maintenance
    • Adding new entities (and predicates).
      • More agents that can go on strike.
      • More metrics where to excel at the city and provincial levels
    • Avoid cut&paste errors.
      • Edit this: "Les dépenses à des activités culturelles seront les plus bas dans la ville."
      • Into this: "Le nombre d'accidents de voiture seront les plus bas dans la ville.

Presenter Notes

php-nlgen

Presenter Notes

Ideas Behind the Framework

  • Recursive Descent Generation
    • Grammar encoded into functions
  • Ontologies
    • A repository of general records
  • Lexicon
    • A repository of lexical entries

Presenter Notes

Recursive Descent Generation

  • Similar to recursive descent parsers
  • Each grammar rule corresponds to a function.
    • arguments, data as needed (function dependent).
    • returns the generated string plus a dictionary with semantic info about the generated string.
  • These functions implement the grammar by calling other grammar-symbols-turned-functions.
  • The order of these calls needs to be such that constraints can be collected and transferred.
    • For example, generate the subject first, so as to know whether it is plural or singular, before generating the verbal phrase.
    • Such ordering might not exist for complex sentences (a full-fledged generation system is then needed).

Presenter Notes

  • Example
  function on_strike($data){
    $actor = $data[0];
    $actor_str = $this->gen('np', array('head'=>$actor),'actor');
    $sem = $this->current_semantics();
    return $actor_str . ' ' .
      $this->gen('on_strike_vp', array('subject' => $sem['actor']));
  }
  function np($data){
    $head = $data['head'];
    if(gettype($head) == "object"){
      $str = $this->gen($head->predicate,$head->args,'subpred');
      $sem = $this->current_semantics();
      return array('text'=>$str, 'sem'=>$sem['subpred']);
    }else if(gettype($head) == "array"){
      $str = array();
      for($i=0;$i<count($head); $i++) {
        $str[] = $this->gen('np', array('head'=>$head[$i]));
      }
      return array('text'=>join(", ", array_slice($str, 0, count($str)-1)) . ' ' .
        $this->lex->string_for_id('conjunction') . ' ' . $str[count($str)-1],
        'sem' => array('gen'=>$gen, 'num'=>'pl'));
    }else{
      return array('text'=>$this->lex->string($head,$data), 'sem'=>$data);
    }
  }

Presenter Notes

Ontologies and Lexicons

  • A set of hierarchical records indexed by unique IDs (called frames).
    • Not unlike a NoSQL database.
    • The difference is the helper functions defined for each.
      • For ontologies we care about the type of a frame.
      • For lexicons, we care about the string(s) verbalizing a concept.
Onto
"city":{ "class":"region" },
"province":{ "class":"region", "includes":"city" }
Lex(En)
"maison_culture":{ "string":"Maison de la Culture", "class":"place" },
"heated_pool":{ "string":"heated pool", "is_short":"1", "class":"place" }
Lex(Fr)
"will_decrease":{ "pl":"diminueront", "sing":"diminuera", "class":"verb" }

Presenter Notes

The Framework Itself

  • Three classes:

    generator.php

    • 192 loc
    • 11 methods

    ontology.php

    • 96 loc
    • 5 methods

    lexicon.php

    • 352 loc
    • 32 methods

Presenter Notes

generator.php

  • Constructor receives an ontology and a lexicon or JSON strings for each.
    • Multilingual extension: a dictionary of language name to lexicon.
  • Each generator has to subclass Generator and implement the method top.
  • Clients call the method generate with data input and context
    • Multilingual extension: context contains the language to generate.
  • Main method gen, wraps the grammar-actions-turned methods to track semantics.
    • Multilingual extension: when calling method 'foo', if it is not defined, the system will try with 'foo_$lang'.
    • The methods can return strings or a dictionary containing the keys 'text' and 'sem'.
    • Either way, a tree is built by successive invocations to gen, by means of a stack.
  • Also of interest: savepoints and rollbacks methods (for backtracking).

Presenter Notes

ontology.php and lexicon.php

  • Ontology class
    • find, has, find_all_by_class, find_by_path
  • Lexicon class
    • find, has, find_all, query, string_for_id, ...
    • The lexicon is quite complex and heavily under development.

Presenter Notes

The Demo

Presenter Notes

Domain Modelling

  • Rules
IF culture_feries > 2 AND culture_dimanche == 52 THEN
  pct_change(reading, 10)

IF culture_feries + culture_dimanche < 30 THEN
  pct_change(reading, -10)

IF culture_spectacles < 50 THEN
  on_strike(employee(maison_culture))

IF deneigement_chargements == 5 AND deneigement_findesemaine < 3 THEN
  pct_change(car_accident, 10)

IF routier_nidsdepoule > 10 THEN
  pct_change(car_accident, -5)
  • Predicates
pct_change( metric, delta )
on_strike( actor )
benchmarked( position, metric, region )

employee( place )
  • Ontology -- minimal

Presenter Notes

The Generator

  • top
    • apply_rules
    • sort_predicates
    • sentence_planning
    • Each predicate is verbalized
      • pct_change -> metric
      • benchmarked -> metric
      • on_strike -> np (might call another predicate, e.g., employee), on_strike_vp

Presenter Notes

Examples

City Culture

Presenter Notes

Province Culture

Presenter Notes

Aggregation

Presenter Notes

Deltas

Presenter Notes

Conclusions

Presenter Notes