IRAsubcat

Example

1. Download IRASubcat in the Download Page

2. Go to folder where you download IRASubcat and uncompress the file, it depend of the version that you download, if you download the tar version, put in command line
tar -xzf IRASubcat.tar
or zip version with
gunzip IRASubcat.zip

3. Go into de folder IRASubcat with
cd IRASubcat

4. Put in this path your corpus, remember that this corpus need to be in UTF-8 or convert your corpus
If you have the corpus with another encoding, you can change the encoding in command line with the following command
iconv -f encoding [-t encoding] [inputfile] [-o outputfile]...
--from-code, -f encoding Convert characters from encoding.
--to-code, -t encoding Convert characters to encoding. If not specified the encoding corresponding to the current locale is used.
--output, -o file Specify output file (instead of stdout).
in XML format, the corpus can be in any language, IRASubcat needs is that the corpus have marked the verbs, like a characteristic in XML with a particular value, but IRASubcat has the capability of take as input a rich corpus, with a lot of information about its items. In order to show it in this example we are going to work with three corpus, which have the name
corpus1.xml (click to display):
here is going to be an example of english corpus
corpus2.xml (click to display):
here is going to be an example of russian corpus
corpus3.xml (click to display):
It is a fragment of the corpus Sensem, it is a corpus of Spanish language, we are choise 10 sentences with the verb 'asegurar', like you can see the corpus has ID of sentences, and has phrases, there are a lot of information for sentences and phrases.
<corpus> <s ID='5453' semor1='Evento' anotado='1-17' verbo='17' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' sujeto_elidido='1' WN_S='00598975'> <phr id='1' rs='T-desp' cat='OEstdir' fs='Obj Directo' Argumento='1'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> </s> <s ID='5452' semor1='Evento' anotado='30-43' verbo='33' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' sujeto_elidido='1' WN_S='00598975'> <phr id='1' cat='SP-OCompl' fs='Circunstancial'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='3' rs='T-desp' cat='OInf' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5451' semor1='Evento' anotado='9-35' verbo='9' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' sujeto_elidido='1' WN_S='00598975'> <phr id='1' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='2' rs='T-desp' cat='OEstdir' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5450' semor1='Evento' anotado='21-36' verbo='22' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' WN_S='00598975'> <phr id='1' rs='Ag_ori' cat='SN' fs='Sujeto' Argumento='1'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='3' rs='T-desp' cat='OEstdir' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5449' semor1='Evento' anotado='1-55' verbo='6' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' WN_S='00598975'> <phr id='1' cat='SP' fs='Circunstancial'></phr> <phr id='2' rs='Ag_ori' cat='SN' fs='Sujeto' Argumento='1'></phr> <phr id='3' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='4' rs='T-desp' cat='OCompl' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5448' semor1='Evento' anotado='24-46' verbo='25' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' WN_S='00598975'> <phr id='1' rs='Ag_ori' cat='PR-Rel' fs='Sujeto' Argumento='1'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='3' rs='T-desp' cat='OCompl' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5446' semor1='Evento' anotado='1-27' verbo='9' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' WN_S='00598975'> <phr id='1' rs='Ag_ori' cat='SN' fs='Sujeto' Argumento='1'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='3' rs='T-desp' cat='OCompl' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5447' semor1='Evento' anotado='1-21' verbo='12' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' WN_S='00598975'> <phr id='1' rs='Ag_ori' cat='SN' fs='Sujeto' Argumento='1'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='3' rs='T-desp' cat='OCompl' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5445' semor1='Evento' anotado='11-28' verbo='12' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' WN_S='00598975'> <phr id='1' rs='Ag_ori' cat='PR-Rel' fs='Sujeto' Argumento='1'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='3' rs='T-desp' cat='OInf' fs='Obj Directo' Argumento='1'></phr> </s> <s ID='5444' semor1='Evento' anotado='1-36' verbo='5' lema_verbo='asegurar'
sentido='asegurar_2' metaf='0' WN_S='00598975'> <phr id='1' rs='Ag_ori' cat='SN' fs='Sujeto' Argumento='1'></phr> <phr id='2' cat='verbo' sentido='asegurar_2' lema_verbo='asegurar'></phr> <phr id='3' rs='T-desp' cat='OCompl' fs='Obj Directo' Argumento='1'></phr> </s> </corpus>
5. Identify in the corpus which one is the characteristic and value to find verbs.The level father of de level that have characteristics to study (that is the same level that have the characteristic for mark verbs) marked with blue color, and the key of the dictionary of output market with green.case corpus1.xml (click to display):
As soon is posible here is going to be un english corpus
case corpus2.xml (click to display):
As soon is posible here is going to be un russian corpus
case corpus3.xml (click to display):
In this case the characteristic is 'CAT' and the value is 'verbo', it is marked with red color.
<corpus>

    <s ID='5453' semor1='Evento' anotado='1-17' verbo='17' lema_verbo='asegurar' 
        sentido='asegurar_2' metaf='0' sujeto_elidido='1' WN_S='00598975'>

            <phr id='1' rs='T-desp' cat='OEstdir' fs='Obj Directo' Argumento='1'></phr>
            <phr id='2' cat='verbo'  sentido='asegurar_2' lema_verbo='asegurar'></phr>
    </s>
6. Now you need to customize the configuration file to accept your corpus, and yours kind of execution. case corpus1.xml (click to display):
As soon is posible here is going to be the configuration for the english corpus example
case corpus2.xml (click to display):
As soon is posible here is going to be the configuration for the russian corpus example
case corpus1.xml (click to display):
Into de IRASubcat folder there is a file named config.cfg it is the configuration file, this file show the options for default, go to the option TARGET TAGS and chance 'sint' for 'fs' it is the function sintactic in this corpus, and save the file. For more information about the configuration file read here
TO CONSIDER VERB LIST = NO
DICTIONARY EXISTING = NO
LENGTH OF SIDE OF THE VERB FOR THE PATTERN = ALL
COMPLETE WITH WORD = NO
ORDER OF TAGS = NO
TARGET TAGS = sint
USE LEXICAL ITEMS = NO
INTRODUCE VERBAL MARK = NO
COLAPSE PATTERNS = NO
MAX ITERATION FOR FIND COLAPSE PATTERNS = FALSE
MINIMAL ABSOLUTE VERBAL FREQUENCY = 0
MINIMAL RELATIVE FREQUENCY TO CONSIDER PATTERN = 0
USE LIKELIHOOD RATIO TEST = YES

7. In command line, into the folder IRASubcat execute:

python IRASubcat.py corpus3.xml cat=verbo s lema_verbo config.cfg
Here cat=verbo s lema_verbo are the parameters that you see in the 5 point. Note that lema_verbo, and cat=verbo need to be at level 'phr' (which have father level 's').

8. It's all!, you have your lexicon in the IRASubcat folder in the file OutputDictionaryOrd.xml, your statistics of execution in the file info_file, and if you have 'ID' attribute al level 's', you have the file IdsSentencesOrigenDictionary.xml with the ID's of sentences that give origin of the patterns in OutputDictionaryOrd.xml

Design by IRASystems