Create your matrix of features from texts

In contributions part, PERL script 1 is about extracting text units and displaying them in a matrix-like format so that the files can be imported as data frames in R for example. They can also be used in classifiers such as TiMBL.
The texts must previously be tokenized and tagged if you want to use the script as is. Also it targets the forms it, this and that. Feel free to modify for your own use.
Here is what the output looks like in my case:
DIDID TOKENS TAGS TOKENS3BEFORE TAGS3BEFORE TOKENS2BEFORE TAGS2BEFORE TOKENS1BEFORE TAGS1BEFORE TOKENS1AFTER TAGS1AFTER TOKENS2AFTER TAGS2AFTER TOKENS3AFTER TAGS3AFTER CONTEXT DISCOURSE
DID0014-S001.seq it PNR me PRP something NN about IN SYM SYM okay JJ 0 0
DID0014-S001.seq that TCOM also RB the DT fact NN peoples NNS are VBP drinking VBG 0 0


The first line corresponds to the headers. In the case of TiMBL, it needs to be deleted.

This entry was posted in Linguistics & SLA. Bookmark the permalink.

Comments are closed.