Outils de clustering diachronique pour analyser l'évolution de la production scientifique    Posted:

Date: 01 July, 2016, 14:30, Room LORIA B-011

Speaker: Nicolas Dugué

Abstract : Au sein du projet ISTEX-R, nous avons pour mission de faciliter le suivi de l'évolution de la production scientifique à travers l'étude de la base de publications ISTEX. Dans ce cadre, nous avons mis en place une solution de clustering diachronique qui permet de suivre les thématiques de recherche à travers le temps : fusion, séparation, apparition, disparition. Nous détaillerons dans un premier temps des outils de mesure de qualité et d'étiquetage de cluster nécessaires à notre démarche. Nous présenterons ensuite des résultats préliminaires sur un corpus ISTEX. Enfin, nous décrirons une plateforme de visualisation dédiée à l'exploration de ces résultats.


Generating Stories from Different Event Orders: A Statistical Approach    Posted:

Date : 02 May 2016, 16:00, Room B011

Speaker: Anastasia Shimorina

Abstract :
The research presents the strategy how to find statistically significant language patterns and make use of them in generating new texts. Namely, the temporal relations in narrative are explored. To investigate the narrative temporal structure, a specially designed corpus of stories is used. For each story, main events and their chronological and discourse-level orders are known. This corpus allows us to identify common temporal models for specific orders of events at the discourse level. The Conditional Random Fields method is applied to predict the best temporal model for each event order. The acquired temporal models are used in a template-based natural language generation system which outputs stories. The stories generated by the system are evaluated by human subjects. We demonstrate that stories generated according to the acquired temporal models are adequately interpreted by humans.


NVIDIA Grant    Posted:

NVIDIA has granted our team a NVIDIA Titan X GPU card, which will be extremely useful to speed up our deep learning experiments about NLP. We have installed the Titan X card and it is now up and running.

We gratefully acknowledge the support of NVIDIA Corporation with the donation of this Titan X GPU to be used in our research !


Transfert cross-lingue de dépendances syntaxiques par apprentissage partiel d'analyseur par transition    Posted:

Date: 18 April, 2016, 11:00, Room LORIA B013

Speaker: Ophélie Lacroix

Abstract : Dans le domaine du TAL, les méthodes d'apprentissage supervisé sont largement exploitées du fait de leur efficacité mais requièrent l'accès à de large ensembles de données correctement annotées. De telles données ne sont néanmoins pas disponibles dans toutes les langues. Le transfert d'information cross-lingue est une des solutions qui permet de construire des outils d'analyse pour des langues peu dotées en s'appuyant sur les informations disponibles dans une ou plusieurs langues sources bien dotées. Nous nous intéressons en particulier au transfert d'annotations en dépendances à l'aide de corpus parallèles. Les annotations disponibles en langue source sont projetées sur les données de la langue cible via des liens d'alignements. Nous choisissons de limiter la projection aux dépendances les plus sûres, générant des données cibles partiellement annotées. Nous montrons alors qu'il est possible d'apprendre un analyseur par transition à partir de données partiellement annotées grâce à l'utilisation d'un oracle dynamique. Cette méthode simple de transfert obtient des performances qui rivalisent avec celles de méthodes état-de-l'art récentes, tout en ayant un coût algorithmique moindre.


Research and Challenges in Natural Language Processing at the GPLSI group    Posted:

Date: 14 April, 2016, 10:30, Room B011

Speaker: Elena Lloret (University of Alicante, Spain)

Title: Research and Challenges in Natural Language Processing at the GPLSI group

Abstract: In this talk, I will introduce the research carried out by the GPLSI Research Group of the University of Alicante (Spain). I will first provide a brief introductory information about the group. Then, I will summarise the main research areas addressed, as well as the most recent projects and applications developed. Finally, I will focus on Text Summarization and Natural Language Generation, the research fields in which I am more interested in. I will outline the work in progress, together with the challenges that need to be faced.


Paraphrase Generation from Latent-Variable PCFGs for Semantic Parsing    Posted:

Date: 21st January, 2016, 14:00, Room C005

Speaker: Shashi Narayan (U. Edinburgh, UK)

Title: Paraphrase Generation from Latent-Variable PCFGs for Semantic Parsing

Abstract : One of the limitations of semantic parsing approaches to open-domain question answering is the lexicosyntactic gap between natural language questions and knowledge base entries -- there are many ways to ask a question, all with the same answer. In this paper we propose to bridge this gap by generating paraphrases to the input question with the goal that at least one of them will be correctly mapped to a correct knowledge-base query. We introduce a novel grammar model for paraphrase generation that does not require any sentence-aligned paraphrase corpus. Our key idea is to leverage the flexibility and scalability of latent-variable probabilistic context-free grammars to sample paraphrases. We do an extrinsic evaluation of our paraphrases by plugging them into a semantic parser for Freebase. Our evaluation experiments on WebQuestions benchmark dataset show that the performance of the semantic parser significantly improves over strong baselines.

Bio: Shashi Narayan is a research associate at School of Informatics at the University of Edinburgh. He is currently working with Shay Cohen on the problems of spectral methods for parsing and generation. Before, he earned his doctoral degree in 2014 from Université de Lorraine, under the supervision of Claire Gardent. He received Erasmus Mundus Masters scholarship (2009-2011) in Language and Communication Technology (EM-LCT). He did his major (Bachelor of Technology (Honors), 2005-2009) in Computer Science and Engineering from Indian Institute of Technology (IIT), Kharagpur India.

He is interested in the application of syntax and semantics to solve various NLP problems, in particular, natural language generation, parsing, sentence simplification, paraphrase generation and question answering.


PhD defense on Surface Realisation from Knowledge Bases    Posted:

Date: 20 January 2016, 09:30, Room C005

Speaker: Bikash Gyawali

Abstract : Natural Language Generation (NLG) is the task of automatically producing natural language text to describe information present in non-linguistic data. It involves three main subtasks: (i) selecting the relevant portion of input data; (ii) determining the words that will be used to verbalise the selected data; and (iii) mapping these words into natural language text. The latter task is known as Surface Realisation (SR). In my thesis, I study the SR task in the context of input data coming from Knowledge Bases (KB). I present two novel approaches to surface realisation from knowledge bases: a supervised approach and a weakly supervised approach.

In the first, supervised, approach, I present a corpus-based method for inducing a Feature Based Lexicalized Tree Adjoining Grammar (FB-LTAG) from a parallel corpus of text and data. The resulting grammar includes a unification based semantics and can be used by an existing surface realiser to generate sentences from test data. I show that the induced grammar is compact and generalises well over the test data yielding results that are close to those produced by a handcrafted symbolic approach and which outperform an alternative statistical approach.

In the weakly supervised approach, I explore a method for surface realisation from KB data which uses a supplied lexicon but does not require a parallel corpus. Instead, I build a corpus from heterogeneous sources of domain-related text and use it to identify possible lexicalisations of KB symbols (classes and relations) and their verbalisation patterns (frames). Based on the observations made, I build different probabilistic models which are used for selection of appropriate frames and syntax/semantics linking while verbalising KB inputs. I evaluate the output sentences and analyse the issues relevant to learning from non-parallel corpora.

In both these approaches, I use the data derived from an existing biomedical ontology as a reference input. The proposed methods are generic and can be easily adapted for input from other ontologies for which a parallel/non-parallel corpora exists.

Keywords: Surface Realisation, Knowledge Bases, Grammar based Surface Realisation, Grammar Learning, Syntax/Semantics linking, Corpus based approaches.


Deep learning working group at LORIA    Posted:

Deep learning is a research topic that raises interest from many people at LORIA, across departments. But it may also be difficult to start with, and once we're in, it's not the end of the story, and there are many new models, approaches, challenges to solve: which architecture to choose for which task, how to tune training, what are the best practices, how to generate/get more data, how to speed-up training (GPUs, HPC, cluster)...

We believe a working group at LORIA on deep learning would benefit to all of us, first by letting us know each other, but also share experience, pieces of code or novel exciting model.

If you're interested, you can freely register/unregister to the mailing-list


by sending me an email (cerisara@loria.fr) or (better) on the sympa.inria.fr site:


General and interesting fast-to-read posts about deep learning:


a "biological view" of SGD: http://www.kdnuggets.com/2015/06/why-does-deep-learning-work.html

"Perhaps, if our algorithms can learn only with gigantic datasets what should be intrinsically learnable with hundreds, we have succumbed to laziness" http://www.kdnuggets.com/2015/10/deep-learning-vapnik-einstein-devil-yandex-conference.html

Annotated recent papers by Hugo Larochelle: https://twitter.com/hugo_larochelle/timelines/639067398511968256


Deploying applications on spark cluster    Posted:

A Spark cluster is a wrapper around Hadoop clusters that makes it very easy to deploy applications onto several machines. The deployment in itself is very simple:

  • Copy the spark directory on every machine, preferably in the same directory on all of them, because it makes configuration simpler

  • On every machine, add to your ~/.bashrc the line:

    export SPARK_HOME=/path/to/spark/home

  • Choose machine X to be the master: on every machine, add in the file $SPARK_HOME/conf/spark-env.sh the lines:

    #!/usr/bin/env bash

    SPARK_MASTER_IP=..... # put here the IP of the master

  • Further create on every machine the file $SPARK_HOME/conf/slaves with the IPs of all slaves (one line per worker node)

  • Ensure that you can ssh from the master to every worker node with ssh keys (without password)

  • Launch the cluster on the master node:

    cd $SPARK_HOME


  • Check the status and jobs on the cluster at


  • You can also find on this page an URI in the form spark://xxxx:7077 that can be used to submit jobs onto the cluster

Now that you have a running cluster, the way I prefer to submit jobs onto this cluster is to modify my java programs so that they automatically send jobs to the cluster. This involves adding the spark.jar to the classpath, and:

  • Create a JavaSparkConf object with the link (spark://xxxx:7077) to the cluster
  • Put all of your data into a java list and call JavaSparkConf.parallelize() onto this list so that Spark distributes the data onto the cluster nodes
  • Call on this RDD at least a .map() operation with the function you want to perform on each data chunk on the workers, followed by a .reduce() operation with a function that recursively merges the results of two data chunks. See some Spark tutorials for further details.


There are many, many potential cause of failures. Here are only a few of them, please share if you find new one:

  • Don't put shared variables (broadcast and accumulators) as a class members, but rather send them as method parameters, otherwise, they'll be null in distant workers.
  • Don't use multiple jars when launching on a real cluster, but you have to rather build a "fat jar" that contains all dependencies. Building this fat jar may be tricky, because some files may overwrite other files with the same name, especially configuration files. Such files may sometimes be handled by appending them into a single large configuration file;
  • Don't create anonymous spark functions, but better create independent classes that implement these functions, because anonymous functions trigger the serialization of their container class, which may create problems because the SparkContext for instance is not serializable.
  • Accumulators have to be read (accu.value()) only once, after all workers have finished; after that, they cannot be updated any more. So if you want intermediary results, split your corpus into chunks, and send each chunk as an independent Spark task, recreating the accumulators for every chunk.
  • It might happen that all of your cores are not used at full speed, because patitions have been incorrectly computed by spark. So use the repartition(n) function on your RDD with a large enough n, e.g. 2 or 3 times the total number of your cores.
  • You must submit your job with the spark-submit.sh script, which computes the required classpath


Contents © 2016 Christophe Cerisara - Powered by Nikola