TreeTagger2NITE NXT with positional tags

Convert from TreeTagger format to NITE XML format and includes positional information (nominative or oblique). Download here.

Posted in Linguistics & SLA | Comments Off

Download perl script to create instances from TreeTagger format files

TT2seq_feat_header_3grams_context.pl a perl program that you can amend at will.

Posted in NLP, PERL scripts | Comments Off

Create your matrix of features from texts

In contributions part, PERL script 1 is about extracting text units and displaying them in a matrix-like format so that the files can be imported as data frames in R for example. They can also be used in classifiers such as TiMBL.
The texts must previously be tokenized and tagged if you want to use the script as is. Also it targets the forms it, this and that. Feel free to modify for your own use.
Here is what the output looks like in my case:
DIDID TOKENS TAGS TOKENS3BEFORE TAGS3BEFORE TOKENS2BEFORE TAGS2BEFORE TOKENS1BEFORE TAGS1BEFORE TOKENS1AFTER TAGS1AFTER TOKENS2AFTER TAGS2AFTER TOKENS3AFTER TAGS3AFTER CONTEXT DISCOURSE
DID0014-S001.seq it PNR me PRP something NN about IN SYM SYM okay JJ 0 0
DID0014-S001.seq that TCOM also RB the DT fact NN peoples NNS are VBP drinking VBG 0 0


The first line corresponds to the headers. In the case of TiMBL, it needs to be deleted.

Posted in Linguistics & SLA | Comments Off

TreeTagger .par file trained on native WSJ corpus

After modifying some this, that and it related PoS tags in the WSJ, I trained TT on this new subset and obtained a .par file (see contribution page) that can be used to tag other corpora with the modified Penn tag set.

Posted in NLP | Tagged , | Comments Off

The acquisition of ‘this’ and ‘that’ by learners

Learners of English do not necessarily have a good command of the demonstratives. There are a variety of unexpected uses and these can be classified according to several criteria. Semantically, learners may experience difficulties when constructing referential processes. Research on deictic and anaphoric processes provides better understanding of their output. At functional level, learners experience difficulties in the selection between one form or the other. Still at functional level, there are two learner-specific micro-systems of use in which the form interact. Firstly, in the proform function, they interact with the pronoun it. Secondly, in their determiner function they interact with the determiner the. It appears that for learners this and that have competitor forms and a close investigation of their use in learner corpora would provide answers on the extent with which such issues arise.

More on this in (Gaillat, 2013a). Draft version here.

Posted in Linguistics & SLA | Comments Off

What it looks like to tag with TreeTagger

Once you have your files, this is what happens

Enjoy and let me know if it helps!

Posted in Linguistics & SLA | Comments Off

What it looks like to train TreeTagger

It takes 3 files and a few seconds ;-)

Video here

Posted in Linguistics & SLA | Comments Off

Customise PoS tags with TreeTagger

It’s possible to add/modify the tagset employed by Treetagger. The solution involves a threefold methodology:
1. Retag a Penn Treebank compliant corpus
2. Train Treetagger on it
3. Used the trained .par file to tag another corpus with TreeTagger

More details on this paper (Gaillat, 2013).

Posted in NLP | Tagged | Comments Off

The purpose of this blog

Welcome to my blog on SLA and linguistics.

The purpose of this blog is to document and share my experience in the use of NLP tools for the linguistic analysis of various corpora (learners and natives).

 

Posted in Linguistics & SLA | Comments Off