Convert from TreeTagger format to NITE XML format and includes positional information (nominative or oblique). Download here.
-
Recent Posts
Meta
Convert from TreeTagger format to NITE XML format and includes positional information (nominative or oblique). Download here.
TT2seq_feat_header_3grams_context.pl a perl program that you can amend at will.
In contributions part, PERL script 1 is about extracting text units and displaying them in a matrix-like format so that the files can be imported as data frames in R for example. They can also be used in classifiers such as TiMBL.
The texts must previously be tokenized and tagged if you want to use the script as is. Also it targets the forms it, this and that. Feel free to modify for your own use.
Here is what the output looks like in my case:
DIDID TOKENS TAGS TOKENS3BEFORE TAGS3BEFORE TOKENS2BEFORE TAGS2BEFORE TOKENS1BEFORE TAGS1BEFORE TOKENS1AFTER TAGS1AFTER TOKENS2AFTER TAGS2AFTER TOKENS3AFTER TAGS3AFTER CONTEXT DISCOURSE
DID0014-S001.seq it PNR me PRP something NN about IN SYM SYM okay JJ 0 0
DID0014-S001.seq that TCOM also RB the DT fact NN peoples NNS are VBP drinking VBG 0 0
…
The first line corresponds to the headers. In the case of TiMBL, it needs to be deleted.
After modifying some this, that and it related PoS tags in the WSJ, I trained TT on this new subset and obtained a .par file (see contribution page) that can be used to tag other corpora with the modified Penn tag set.
Learners of English do not necessarily have a good command of the demonstratives. There are a variety of unexpected uses and these can be classified according to several criteria. Semantically, learners may experience difficulties when constructing referential processes. Research on deictic and anaphoric processes provides better understanding of their output. At functional level, learners experience difficulties in the selection between one form or the other. Still at functional level, there are two learner-specific micro-systems of use in which the form interact. Firstly, in the proform function, they interact with the pronoun it. Secondly, in their determiner function they interact with the determiner the. It appears that for learners this and that have competitor forms and a close investigation of their use in learner corpora would provide answers on the extent with which such issues arise.
More on this in (Gaillat, 2013a). Draft version here.
Once you have your files, this is what happens …
Enjoy and let me know if it helps!
It takes 3 files and a few seconds
It’s possible to add/modify the tagset employed by Treetagger. The solution involves a threefold methodology:
1. Retag a Penn Treebank compliant corpus
2. Train Treetagger on it
3. Used the trained .par file to tag another corpus with TreeTagger
More details on this paper (Gaillat, 2013).
Welcome to my blog on SLA and linguistics.
The purpose of this blog is to document and share my experience in the use of NLP tools for the linguistic analysis of various corpora (learners and natives).