Printer Friendly

RECOGNIZING THE VOCABULARY OF BRAZILIAN POPULAR NEWSPAPERS WITH A FREE-ACCESS COMPUTATIONAL DICTIONARY.

Introduction

Written text of popular newspapers in Brazil has still been little investigated in language studies and is somewhat overlooked as a source of data to elaborate language resources for natural language processing (NLP), and even to feed the large corpora that represent current Portuguese. However, this type of source is considered to be a new and important type of vehicle for communication (AMARAL, 2006) and has already been used to generate data for some promising computational applications associated to the description of Brazilian Portuguese (BP) usage, as we see, for example, in Zilio (2015). The various Brazilian popular newspapers have also facilitated studies on new formats of journalistic genres in communication studies (TRISTAO; MUSSE, 2013; SELIGMAN, 2008); among linguists, they have served as a source of teaching material for Portuguese as a native or foreign language (FINATTO; PEREIRA, 2014) or as objects of study in critical discourse analysis (SOARES, 2017).

The vocabulary of this new type of newspaper, however, challenges the coverage of NLP dictionaries. Also known as NLP lexicons, these dictionaries are designed with the particular purpose to be looked up by computer systems, functioning as parts of specific programs (see ZAVAGLIA, 2006). NLP dictionaries, called lexicos computacionais in Portuguese (CHISHMAN, 2016), are those dictionaries built specifically for use in NLP, and for their proper operation they require the careful incorporation of a whole range of linguistic descriptive data (cf. LAPORTE, 2013). Such incorporation involves a process of improvement throughout different versions of these dictionaries. Among different NLP dictionaries, we picked out DELAF PB, freely available for use in different research projects, in linguistics and in computer science, which justifies a critical study of its different versions.

Thus, having in mind the hypothesis that we might uncover gaps in existing dictionaries by using them to process the vocabulary of Brazilian popular newspapers, we carried out the experiment reported in this paper, which consisted in identifying the words (1) in this type of newspaper, and in observing if they are covered or recognised by the two versions of the free-access computational dictionary. The Brazilian popular newspapers that provided the vocabulary were Diario Gaucho ("Daily of Rio Grande do Sul", henceforth DG) and Bahian newspaper Massa! ("Amazing!", MA). The texts we used can be consulted through search words on the website of the Por-Popular Project. (2) More information on these media, their texts and more frequent vocabulary can be found in Amaral (2006) and Finatto (2012). Oliveira (2009) gives references on popular newspapers from other regions of Brazil.

As already mentioned, the language resource selected to test the recognition of the vocabulary of popular newspapers was DELAF PB, an NLP dictionary of BP distributed along with the UNITEX system (PAUMIER, 2016). We used two versions of this dictionary: DELAF PB 2004 (MUNIZ, 2004) and DELAF PB 2015 (VALE; BAPTISTA, 2015; CALCIA et al., 2014). Although this dictionary is still little used by Brazilian linguists, there have been publications about it in Brazil for a long time (MUNIZ, 2004), as well as about its suitability for different research projects (e.g., ALMEIDA; FERREIRA, 2007; VICENTINI, 2010), both in NLP and in applied linguistics, which also justifies its examination in this work. Again, its access is free, and the software provides assistance for adaptation and insertion of entries according to the users' needs.

The DELAF PB dictionaries, thus, are built and updated collaboratively. They have been quite useful in different applications, especially NLP products. An example of a very recent and successful application in Brazil of the DELAF PB dictionaries is the work of Paiva, Barbosa, Faria and Martino (2017). The award-winning research of these authors brought together linguists and computer scientists and produced an automatic translator of written Brazilian Portuguese into the Brazilian Sign Language (LIBRAS). (3)

With the DELAF PB dictionary in its 2004 and 2015 versions, we also wanted to assess its update. Only the more recent version complies with the Portuguese Language Orthographic Agreement. The texts of the popular newspapers that we used were produced before (DG) and after (MA) the adoption of the new spelling. Thus, we could also observe the impact of the presence of two spelling standards on the performance of the resource.

In addition to the above, the general purposes of this work are:

a) to disseminate Brazilian popular newspapers as a source of research on the lexis in written form;

b) to contribute useful data for future extensions and improvements of DELAF PB.

The rest of this paper contains: a) data on the corpus of popular newspapers used in the experiment; b) a presentation of the dictionaries and a description of the operation of recognising or identifying the words in a corpus; c) the stages of the work, with comments and an overview of the results; d) a characterization of the majority profile of the items not covered by the dictionaries, and a discussion of how the performance of the dictionaries might be improved.

Text corpus and samples under analysis

From DG, we used as a starting point only part of the collection available in the PorPopular corpus. In the DG corpus, we selected texts published in 2008 that total 984,465 tokens. We obtained, therefore, a corpus with almost one million words, a significant size in this type of study. In this sample under examination to verify the vocabulary of DG, the texts are of varied types and correspond to what is published in the complete daily newspaper in its printed format. There are general news, horoscope columns, sports columns, police news and miscellaneous topics. Other parts of the DG corpus have already been used in previous studies (ZILIO, 2015; FINATTO et al., 2011) for diverse purposes.

For the corpus of the MA newspaper, we started from a smaller sample, much more specific in terms of text typology: a set of 724 texts, comprised only of news published in the online version of the newspaper and dealing with various themes. (4) These texts add up to 215,776 tokens and were published in 2012, 2014 and 2015.

The DG texts, as already mentioned, date from 2008, before the most recent orthographic change of BP. Information on the quantitative effect of this change on the spelling variation observed in newspapers can be read in Flores and Finatto (2009). For a general view of the impact of this orthographic alteration, see the collection organized by Moreira, Smith and Bocchese (2009).

This reform came in force before the MA texts were produced, but after the DG texts, so we had to deal with two spelling standards in our experiment. Thus, the words in the DG sample of 2008 still have hyphens, accents and tremas (ex: aguentar) which lack in the MA sample (ex: aguentar). These differences led us to use two versions of the DELAF PB dictionary, one for each spelling standard. Moreover, the use of both versions, from 2004 and 2015, also served to verify the comprehensiveness of the update, in terms of the coverage of the items used in these popular newspapers, with and without the new spelling.

Thus, we carried out the verification with both versions of the dictionary: DELAF PB 2004 (MUNIZ, 2004) and DELAF PB 2015 (VALE; BAPTISTA, 2015; CALCIA et al., 2014).

The DELAF dictionaries

'DELAF' is, in fact, a format of NLP dictionaries that originated from research carried out for the French language by the team led by linguist Maurice Gross in the Laboratory of Documentary and Linguistic Automation (LADL). (5) This work was subsequently extended to other languages through the RELEX network of laboratories. (6)

DELAF dictionaries describe the simple words and multiword units of a given language by associating each form with both a lemma and a series of grammatical, semantic and inflectional codes.

The DELAF dictionaries have been developed by teams of linguists for various languages (French, English, Greek, Italian, Spanish, German, Thai, Korean, Polish, Norwegian, Portuguese, Arabic, among others) and are used today in academic and industrial projects. Several, among them the French and the Brazilian Portuguese DELAFs, are freely accessible and updated collaboratively.

Principle of operation of the dictionaries in the UNITEX system

UNITEX is a free-access corpus analyser that processes texts in natural languages with the aid of language resources. These resources are NLP dictionaries in the DELAF format, grammars, and Lexicon-Grammar tables integrated into the system. Some language resources are distributed along with UNITEX and others can be elaborated by users. The UNITEX system is freely accessible and currently used to support studies on different languages (PAJIC et al., 2018), including ancient languages (KINDT, 2018).

As shown by Almeida and Ferreira (2007), the UNITEX system allows for processing any set of texts, so that the linguistic expressions in it are located and categorized. Locating or identifying expressions in a given text or corpus will be effective "provided that such expressions are represented in the dictionary coupled with it" or that the user describes them in a query. UNITEX allows for identifying words by classes.

Once some text has been selected, such as that of our popular newspapers, UNITEX offers to preprocess it. Such preprocessing consists in applying to the text the following operations: normalizing delimiters, segmenting into tokens, normalizing unambiguous forms, segmenting into sentences and, finally, applying NLP dictionaries present in the computer. The presence of these dictionaries in the UNITEX system is a differential with other usual search tools for finding word patterns in corpora, since one can find large classes of words with simple patterns.

When a given corpus or text in a given language is processed, the internal operation of UNITEX consists of constructing a subset of the dictionaries, with only the forms present in the corpus being processed. Thus, for example, the application of the DELAF dictionary on a text such as O time de Neymar corria atras do prejuizo 'Neymar's team was trying hard to make up lost ground' [lit. 'Neymar's team was running after the loss'] will produce the following subset of the dictionary of simple words:
   atras,.ADV
   corria,correr.V:I1s
   corria,correr.V:I3s
   de,.PREP
   do,.PREPXDET+Art+Def:ms
   do,.PREPXPRO+Dem:ms
   o,.DET+Art+Def:ms
   o,.N:ms
   o,.PRO+Dem:ms
   o,ele.PRO+Pes:A3ms
   prejuizo,.N:ms
   time,.N:ms


The name Neymar, being described in a dictionary of proper names distinct from the Portuguese DELAF, will be considered an 'unknown word'.

The application of the dictionaries to the text is performed by the UNITEX system with its program called DICO and generates subsets--'sub dictionaries'--called simple words; compound words; unknown words. In this work, we only deal with the last group.

UNITEX has a second use: it offers support for easily inserting new words and grammatical information into the user's NLP dictionaries, thus adapting them to different purposes. The UNITEX dictionaries deploys the formalism of the DELA (Electronic Dictionaries of the LADL). This formalism allows for describing the simple and multiword lexical entries of a language, optionally assigning them grammatical, semantic and inflectional information. Within this formalism, two kinds of dictionaries are distinguished. The most frequently used kind is the dictionary of inflected forms, in the DELAF format (DELA of inFlected forms). The second kind is the dictionary of lemmas, in the DELAS (DELA of simple words) or DELAC (DELA of compound words) formats, which generates the other dictionaries. In this work, we deal only with DELAF resources. An entry of DELAF PB is organized as follows:

sambou, sambar.V:J3s

Here, sambou 'danced samba' is the form found in the text, 'sambar' the lemma, 'V' the part of speech--in this case a verb--and 'J3s' the inflectional code--in this case a third person singular preterite. The complete list of grammatical and inflectional codes for BP can be found in Muniz (2004).

For BP, the UNITEX system comes with an NLP dictionary in the following two versions:

--the 2005 version: from a DELAS with 61,335 words, a DELAF with 878,095 simple inflected words and 4,100 multiword inflected units was generated. These resources were created from the dictionary of ReGra--the base of the BP spell checker of Word for Windows--for nouns, adjectives and adverbs (MARTINS et al., 1998), and from Vale (1990) 102 verbal inflection templates.

--the 2015 version: the forms with the new spelling resulting from the 1990 Orthographic Agreement were incorporated. 7,900 new entries of simple lemmas (nouns, adjectives and adverbs) were introduced, in addition to verb forms with enclitics and mesoclitics, which were not in the first version. With these changes, the DELAF 2015 of Brazilian Portuguese now totals 10,954,724 entries, describing 7,632,498 unique forms.

Stages of work and results

In general terms, the verification experiment involved:

a) generating the list of words (types) used in the DG newspaper--without changing the spelling;

b) generating the word list of the MA newspaper;

c) comparing each list with the entry lists of the 2004 and 2015 versions of the NLP dictionary DELAF PB;

d) assessing the lexical coverage of each sample--in terms of tokens and types--by DELAF PB, in both versions;

e) proposing ways of including items not identified by the dictionaries.

In these stages, the verification generated two lists of words unknown by the DELAF-PB from each newspaper, for each of the two spelling standards (lists [DG.sub.04], [DG.sub.15], [MA.sub.04] and [MA.sub.15]). Then case sensitivity was removed.

At the beginning of the comparison process, we realized DELAF PB 2015 did not contain the list of abbreviations and acronyms--respectively labelled ABREV and SIGL--which were in the 2004 version of the dictionary. As a matter of fact, during the revision by Calcia et al. (2014), the forms of abbreviations and acronyms (such as ABS or ABNT in the DG newspaper, or UFBA or UFC in MA) were not the object of study in the update of the dictionary and were moved into a separate dictionary. To standardize the experimental conditions, lists [DG.sub.15] and [MA.sub.15] were generated again using DELAF PB 2015 along with the dictionary of abbreviations and acronyms. In what follows, the statistics for DELAF PB 2015 are the result of this second generation.

Tables 1 and 2 below reproduce parts of each of these lists. We display the first 30 items unknown by the UNITEX dictionaries starting with the letter U, and then the first 30 with initial A:
Table 1--Sample of list of items from DG and MA unknown
in DELAF PB, starting with U.

Items      [DG.sub.04]         [DG.sub.15]
starting   DELAF 2004          DELAF 2015
with U

1.         uai                 uai
2.         uau                 uau
3.         ubial               ubial
4.         ubs                 ubs
5.         udesca              udesca
6.         udi                 udi
7.         udiche              udiche
8.         udine               udine
9.         udinese             udinese
10.        uebel               uebel
11.        uefa                uefa
12.        ufa                 ufa
13.        ufcspa              ufcspa
14.        uflacker            uflacker
15.        ufsm                ufsm
16.        ugapoci             ugapoci
17.        ughini              ughini
18.        ugowski             ugowski
19.        uilson              uilson
20.        ulala               ulala
21.        ulbra               ulbra
22.        uli                 uli
23.        ulmen               ulmen
24.        ulsan               ulsan
25.        ultramen            ultramen
26.        ultrasom            ultrasom
27.        ultrassonografias   umbom
28.        umbom               umchorao
29.        umchorao            umespa
30.        umespa              unasul

Items      [MA.sub.04]     [MA.sub.15]
starting   DELAF 2004      DELAF 2015
with U

1.         ualex           ualex
2.         uanderson       uanderson
3.         ubandista       ubandista
4.         ubang           ubang
5.         ubata           ubata
6.         ubirace         ubirace
7.         ucla            ucla
8.         uefs            uefs
9.         uellinton       uellinton
10.        uelliton        uelliton
11.        uenf            uenf
12.        ueslei          ueslei
13.        uezo            uezo
14.        ufc             ufc
15.        uff             uff
16.        ufrrj           ufrrj
17.        uhu             uhu
18.        uibai           uibai
19.        ulicio          ulicio
20.        umidificador    unasul
21.        unasul          under
22.        under           undime
23.        undime          uneb
24.        uneb            unifacs
25.        unifacs         unifcas
26.        unifcas         unirio
27.        unirio          unit
28.        unit            united
29.        united          universitario
30.        universitario   uol

Source: Author's elaboration.

Table 2--Sample of list of items from DG and MA unknown in
DELAF PB, starting with A.

Items             [DG.sub.04]       [DG.sub.15]
starting with A   DELAF 2004        DELAF 2015

1.                aabb              aabb
2.                Aaliyah           aaliyah
3.                aas               aas
4.                abachilov         abachilov
5.                abadia            abadia
6.                abandon           abandon
7.                Abate             abbey
8.                abbey             abbott
9.                abbott            abdel
10.               abdel             abdomem
11.               abdomem           abdul
12.               abdominoplastia   abdulla
13.               Abdul             abebe
14.               abdulla           abech
15.               Abebe             abelao
16.               Abech             abelhocidio
17.               Abelao            abenicio
18.               abelhocidio       ablo
19.               abenicio          about
20.               Ablo              abp
21.               aborda            abracao
22.               aborigenes        abraciclo
23.               About             abramet
24.               Abp               abramovich
25.               Abraca            abrh
26.               abracao           abrhrs
27.               abraciclo         abrigagem
28.               abramet           abrilina
29.               abramovich        abrito
30.               Abrh              abs

Items             [MA.sub.04]     [MA.sub.15]
starting with A   DELAF 2004      DELAF 2015

1.                abada           abada
2.                abadabraco      abadabraco
3.                abadas          abadas
4.                abaralhau       abaralhau
5.                abdelmassih     abdelmassih
6.                abdula          abdula
7.                abefin          abefin
8.                aberbach        aberbach
9.                abisson         abisson
10.               abla            abla
11.               aborda          aboubacar
12.               aboubacar       abravanel
13.               abravanel       academiagf
14.               academiagf      accosta
15.               accosta         acessando
16.               acessando       acessar
17.               acessar         acesse
18.               acesse          acm
19.               acm             adab
20.               adab            adailson
21.               adailson        adailton
22.               adailton        adan
23.               adan            adanascimento
24.               adanascimento   adecir
25.               adecir          adelmario
26.               adelmario       adelmo
27.               adelmo          ademi
28.               ademi           ademilson
29.               ademilson       adenilton
30.               adenilton       aderam

Source: Author's elaboration.


After the generation of lists of unknown words by the DELAF 2004 and 2015 dictionaries, these lists were compared to one another, ignoring case. The items were studied as to their use in texts and considering the information recorded in two conventional dictionaries of BP: Aurelio (FERREIRA, 1999) and Houaiss (HOUAISS, VILLAR, 2009). They were tentatively grouped into categories such as:

(1) Typing errors (umchorao, ubandista);

(2) Old spellings (ideia 'idea');

(3) Proper names (uilson, uanderson);

(4) Abbreviations/acronyms (abs, ufrrj);

(5) Diverse expressions/slang/foreignisms (ulala 'my god!', university, united);

(6) Other nouns (umidificador 'humidifier');

(7) Other (abadabraco, aboubacar).

These categories, of course, were a tentative first approach to the out-of-coverage items in the lists, and they can be refined in future work. Some words can be classified as neologisms or regionalisms, for example. Some are at the same time a noun and a neologism (cf. abelhocidio 'bee killing'). We found virtually no adjectives or verbs among out-of-coverage items, so we did not establish categories 'verb' or 'adjective' in this initial approximation. A multifactorial categorization of out-of-coverage items would mean another work.

Results: overview and summary

We summarize in Table 3 below the main results obtained from the two samples of popular newspapers. These results will be discussed in the next section.

Considerations on the results--Identification of the vocabulary

DELAF PB has a large lexical coverage of twentieth-century newspapers and nineteenth-century literary texts (1.9% of types are out of coverage in the novel Senhora by Jose de Alencar). By comparison, the percentages of out-of-coverage types in Table 3 (from 12% to 19%) are appreciably higher. Therefore, the vocabulary of the kind of newspaper under study can be seen as an important impediment to the identification of words by the DELAF-PB dictionaries. This is relevant for BP researchers interested in its future use.

In order to contextualize the results summarized in the previous section, it is important to remember the factors involved in the identification of items in the vocabulary of popular newspapers by the DELAF PB dictionaries:

a) DG texts have words in the old spelling standard--before the agreement;

b) MA texts have words in the present spelling;

c) only DELAF PB 2015 complies with the new spelling;

d) DELAF PB 2004 does not include the new spelling.

The list of words (types) employed in DG, a set of 53,966 items, includes 19.48% of items unknown by DELAF 2004 and 18.47% of items not covered by DELAF 2015. Thus, there is a small reduction in this percentage of out-of-coverage words between the two versions of the dictionary.

The DG corpus does use an old spelling standard, but the fact that old spellings, such as aguentar, are not present in DELAF 2015 seems to have affected the performance less than the insertion of new words, such as umidificador ('humidifier'), and of verb forms with clitics, such as aborda-lo ('approach him/it'). So, the percentage of out-ofcoverage words has decreased.

In the MA newspaper, we had 22,414 types, of which 13.60% are unknown in DELAF 2004 versus 12.35% in DELAF 2015. In this case, the spelling is entirely in the present standard. Again, text coverage by DELAF improved from 2004 to 2015. This time, the effect of the adaptation of the dictionary to the spelling agreement added to the effect of inserting new words and verbs with clitics.

Thus, both in the MA corpus and in the DG corpus, from DELAF 2004 to DELAF 2015, the performance improved slightly with respect to the recognition of items from the popular newspaper. The recognition of items, from 2004 to 2015, increased, on average, by 1.13 percentage point in terms of recognised types.

In terms of number of tokens, the DG vocabulary is noticeably more covered (96.4% on average) than the MA vocabulary (on average 94.8%), with little effect of the update of the dictionary on this performance. Several kinds of reasons, of course, may explain this. However, it is worth considering that the MA newspaper is from the Northeast region of Brazil, and DG from the South region, which can affect the observed lexical profile. In terms of types, the difference in coverage appears to be the other way around, but this comparison is not significant because of the difference in sample size: in a larger sample, such as that of MA, statistics on the number of types lead to over-representation of infrequent words, which are less likely to appear in the dictionary.

In any case, popular newspapers prove to be an interesting and challenging object of research. However, different current corpora of Brazilian Portuguese are still composed only, as a rule, of data collected from large traditional newspapers, e.g. Folha de S. Paulo (ALUISIO; ALMEIDA, 2006).

As we have seen, for the DELAF PB dictionaries, the vocabulary of popular newspapers seems somewhat "odd". Thus, future work could contrast the percentage of out-of-coverage words between popular and traditional newspapers from the same period.

Another interesting question about out-of-coverage vocabulary is its potential connection with regionalist or local topics (such as the noun cacetinho in DG, the equivalent of pao frances, 'smaller French bread', in Southern Brazil) or with newly created proper names (such as Abadabraco in MA).

In the next section, we outline the most frequent categories of words unknown by the two versions of the dictionaries, based on our samples. Then we briefly discuss how some out-of-coverage items could be incorporated into the DELAF PB 2015 dictionary.

Profile of out-of-coverage words and options for extension of DELAF PB 2015

As we have already noted, the examination of the lists of words in our tables shows that the 2015 version of DELAF PB achieved only a modest gain in coverage for this particular type of newspaper. Thus, the listing of the first 30 words that start with A is practically identical in both columns that refer to the MA corpus. We note the presence of regional vocabulary (abada 'sleeveless shirt', abaralhau 'a Bahian dish with peeled beans and salt cod'--MA) and a set of proper names. Although DELAF PB contains a good number of proper names (Aldemario, Abramovich) that would not normally appear in a conventional dictionary, there is still good work to be done in listing and characterizing proper names, in particular because of their rich variation in Brazil (see Uanderson, Uellinton, Ueslei).

A closer examination of the lists also shows that most of the unrecognized items are nouns (in DG: abrigagem 'sheltering', abdomem 'abdomen', abelhocidio 'bee killing'), including proper names (Abelao). The only two verbs identified in the lists presented here were the verb acessar 'access', with several forms, and the form aborda of the verb abordar 'approach', one of the forms with clitics, which were not covered in the 2004 version of DELAF PB (aborda-lo 'approach him/it').

Abbreviations and acronyms are another issue. Mastering the construction of NLP dictionaries requires a special effort, integrating linguists and computer scientists. The construction of more comprehensive computational resources, taking into consideration recurrent processes and phenomena of the current Portuguese lexicon, is a very complex challenge. In particular, abbreviations, acronyms and proper names are recurrent phenomena in written language and require specific work to preserve and improve the operation of NLP systems. Papers such as Vale et al. (2008) have already pointed out this need in the case of Portuguese historical corpora, also encompassing contemporaneous text collections. As a matter of fact, building specific NLP dictionaries for abbreviations, acronyms and named entities could be an appropriate way to address the challenge we posed to the UNITEX system with our popular newspapers. Of course, as we can see in the following excerpts from two news stories in our corpus, there is much more to be explored:

EXCERPT 1:

A primeira delas e o lancamento do Abadabraco, um bloco que desfilara sem cordas, mas com os folioes. Desfilara sem cordas, mas com os folioes devidamente trajados com abadas. A proposta aqui e incluir. E ter mais pessoas brincando nas ruas e com direito a usar o seu abada. ('The first of them is the launch of Abadabraco, a Carnival group that will march without a rope line, but with its members. (7) It will parade without a rope line, but the revellers will wear the proper shirts. The proposal here is to be inclusive. It's to have more people having fun in the streets and being allowed to use their shirt.')

EXCERPT 2:

Chaleira, Cesar Oliveira & Rogerio Melo, Bochincho, Os Quatro Gauderios, Portal Gaucho e Eco do Minuano & Bonitinho. Foi grande a integracao entre as invernadas adulta e xiru na Sociedade Gaucha de Lomba Grande, em Novo Hamburgo, que comemorou 70 anos na noite de terca-feira. ('Chaleira, Cesar Oliveira & Rogerio Melo, Bochincho, The Four Wanderers, Portal Gaucho and Eco do Minuano & Bonitinho. There was a strong sense of togetherness between the adult and senior dance groups at the Club of Rio Grande do Sul at Lomba Grande, in Novo Hamburgo, which celebrated its 70th birthday on Tuesday night.')

Conclusion

The research problem addressed in this work was to describe and discuss the performance of a large-coverage NLP dictionary, freely available to be used for research on Brazilian Portuguese and in NLP industry.

The DELAF dictionary under examination, although a quite valuable help for different linguistic tasks, might be extended with data from corpora of Brazilian popular newspapers. After all, as already mentioned, this journalistic genre has still been little contemplated as a source of data for the study of cultured written Portuguese. Such lexical incompleteness, in terms of number of out-of-coverage tokens, is noticeably lower in the vocabulary of the newspaper from Rio Grande do Sul (on average, 96.4%) than from Bahia (an average of 94.8%), a fact that motivates further research. Therefore, by examining the performance of different versions of the dictionary in processing the vocabulary of Brazilian popular newspapers, we have demonstrated the validity of taking both types of resources as objects of study and as sources of data for language research conducted in cooperation by linguists and computer scientists.

Acknowledgments

This study was financed in part by the Coordenacao de Aperfeicoamento do Pessoal de Nivel Superior--Brasil (CAPES) in the scope of the CAPES-STIC-AMSud (proj.047/2014), by the the CNPq--National Council for Scientific and Technological Development (PQ Grant--proc. 305625/2016-0 and APV Grant proc. 453058/20159), and Fundacao de Amparo a Pesquisa do Estado de Sao Paulo--FAPESP (proc. 2016/24670-3).

FINATTO, M.; VALE, O.; LAPORTE, E. Reconhecimento do vocabulario de jornais populares brasileiros por um dicionario computacional de acesso livre. Alfa, Sao Paulo, v.63, n.1, p.67-85, 2019.

REFERENCES

ALMEIDA, M. L. L.; FERREIRA, R. G. Ferramentas eletronicas de busca e a pesquisa linguistica: o caso do angulador "um tipo de". Estudos Linguisticos, Sao Paulo, v. 35, n.1, p. 188-196, 2007.

Available in: http://www.gel.hospedagemdesites.ws/estudoslinguisticos/edicoes anteriores/4publica-estudos-2007/sistema06/20.PDF. Access on: 17 Jan. 2018.

ALUISIO, S. M.; ALMEIDA, G. M. B. O que e e como se constroi um corpus? licoes aprendidas na compilacao de varios corpora para a pesquisa linguistica. Calidoscopio, Sao Leopoldo, v.4, n. 3, p. 155-177, set./dez. 2006.

AMARAL, M. F. Jornalismo popular. Sao Paulo: Contexto, 2006.

BIDERMAN, M. T C. Conceito linguistico de palavra. Revista Palavra, Rio de Janeiro, n. 5, p. 81-97,1999.

BIDERMAN, M. T C. Dimensoes da palavra. Filologia e Linguistica Portuguesa, Lisboa, n. 2, p. 81-118, 1998.

Available in: http://dlcv.fflch.usp.br/sites/dlcv.fflch.usp.br/files/Biderman1998_0.pdf. Access on: 17 Jan. 2018.

CALCIA, N. P.; KUCINSKAS, A. B.; MUNIZ, M.; NUNES, M. G. V.; VALE, O. A. Revision et adaptation des dictionnaires et graphes de flexion d'Unitex-PB a la nouvelle orthographe du portugais. 2014. Paper presented at the 3rd. UNITEX/ GRAMLAB WORKSHOP, Tours, 2014.

CHISHMAN, R. Convergencias entre semantica de frames e lexicografia. Linguagem em (Dis)curso--LemD, Tubarao, v. 16, n. 3, p. 547-559, set./dez. 2016.

Available in: http://www.scielo.br/pdf/ld/v16n3/1518-7632-ld-16-03-00547.pdf. Access on: 17 Jan. 2018.

FINATTO, M. J. B. Projeto PorPopular, frequencia de verbos em portugues e no jornal popular popular brasileiro. In: ISQUERDO, A. N.; SEABRA, M. C. T. da C. (org.). As Ciencias do Lexico: lexicologia, lexicografia, terminologia. Campo Grande: Ed. da UFMS, 2012. v.6, p. 227-244.

FINATTO, M. J. B.; PEREIRA, A. Frequencias de verbos em corpora de jornais populares: dados para atividades ensino com os jornais "Diario Gaucho" e o "The Sun". Linguagem Estudos e Pesquisas, Catalao, v.18, n. 2, p. 149-165, 2014. Available in: https://www.revistas.ufg.br/lep/article/view/39730/21165. Access on: 30 Jan. 2018.

FINATTO, M. J. B.; SCARTON, C.; ROCHA, A.; ALUISIO, S. M. Caracteristicas do jornalismo popular: avaliacao da inteligibilidade e auxilio a descricao do genero. In: SIMPOSIO BRASILEIRO DE TECNOLOGIA DA INFORMACAO E DA LINGUAGEM HUMANA, 8., 2011, Cuiaba. Anais do STIL 2011. Cuiaba: Sociedade Brasileira de Computacao, 2011. v. 1, p. 30-39.

FERREIRA, A. B. de H. Dicionario Eletronico Aurelio Seculo XXI. Versao 3.0. Rio de Janeiro: Nova Fronteira: Lexikon Informatica, 1999. 1 CD-ROM.

FLORES V.; FINATTO, M. J. B. Quantificacao e argumento de autoridade no Acordo Ortografico de 2009: aspectos enunciativos e estatisticos. In: MOREIRA, M. E.; SMITH,

M. M.; BOCCHESE, J. da C. (org.). Novo acordo ortografico da lingua portuguesa: questoes para alem da escrita. Porto Alegre: EDIPUCRS, 2009. p. 109-136.

HOUAISS, A.; VILLAR, M. S. Dicionario Houaiss de Lingua Portuguesa. Elaborado pelo Instituto Antonio Houaiss de Lexicografia e Banco de Dados da Lingua Portuguesa. Rio de Janeiro: Objetiva, 2009.

KINDT, B. Processing tools for Greek and other languages of the Christian Middle East. Journal of Data Mining and Digital Humanities, Cedex, 2018. Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages. Available in: https://jdmdh.episciences.org/4184. Access on: 27 Jun. 2018.

LAPORTE, E. Dictionaries for language processing: readability and organization of information. In: LAPORTE, E.; SMARSARO, A.; VALE, O. A. (org.) Dialogar e preciso: linguistica para o processamento de linguas. Vitoria: PPGEL/UFES, 2013. p. 119-132.

MARTINS, R. T.; HASEGAWA, R.; NUNES, M. D. G. V; MONTILHA, G.; OLIVEIRA, O. N. Linguistic issues in the development of REGRA: a grammar checker for Brazilian Portuguese. Natural Language Engineering, Cambridge, v. 4, n. 4, p.287-307, 1998.

MOREIRA, M. E.; SMITH, M. M.; BOCCHESE, J. da C. (org.). Novo acordo ortografico da lingua portuguesa: questoes para alem da escrita. Porto Alegre: EDIPUCRS, 2009.

MUNIZ, M. C. M. A construcao de recursos linguistico-computacionais para o portugues do Brasil: o projeto de Unitex-PB. 2004. 92f. Dissertation (Master in Ciencia da Computacao e Matematica Computacional)--Instituto de Ciencias Matematicas de Sao Carlos, Universidade de Sao Paulo, 2004.

Available in: http://ladl.univ-mlv.fr/brasil/bibliografia/oto/DissMuniz2004.pdf. Access on: 22 Jan. 2018.

OLIVEIRA, M. R. A. R. Jornal Popular X Jornal Tradicional: analise lexico-gramatical da noticia a partir da Linguistica de Corpus: um estudo de casos dos jornais cariocas "O Globo" e "O Dia". Veredas: Revista de Estudos Linguisticos, Juiz de Fora, v. 13, n. 2, p. 07-19, 2009.

PAIVA, F. A.; BARBOSA, P.; FARIA, P.; MARTINO, J. M. de. Towards machine translation from Brazilian Portuguese to libras: a corpus-based, morphosyntactic analysis In: EVERS, A.; NARDES, A.; BRANGEL, L.; FINATTO, M. J. B.; CHISHMAN, R. L. (org.). Caderno de Resumos do ELC-EBRALC 2017. Sao Leopoldo: UNISINOS, 2017. p.24-28.

Available in: http://www.ufrgs.br/elc-ebralc2017/caderno-de-resumos/Cadernode Resumos_ELC2017_16out17.pdf. Access on: 22 Jan. 2018.

PAJIC, V; VUJICIC STANKOVIC, S.; STANKOVIC, R.; PAJIC, M. Semi-automatic extraction of multiword terms from domain-specific corpora. The Electronic Library, Oxford, v.36, n.3, p.550-567, 2018. DOI: https://doi.org/10.1108/EL-06-2017-0128.

PAUMIER, S. Unitex 3.1: user Manual. Paris: Universite Paris-Est Marne-la-Vallee, 2016.

SELIGMAN, L. Jornais populares de qualidade--etica e sensacionalismo em um novo fenomeno no mercado de jornalismo impresso. In: ENCONTRO NACIONAL DE PESQUISADORES EM JORNALISMO--SBPJor, 6., 2008, Sao Bernardo do Campo. SBPJOR--a construcao do campo do jornalismo no Brasil. Sao Bernardo do Campo: Sociedade Brasileira de Pesquisa em Jornalismo, 2008. Available in: https://bjr.sbpjor.org.br/bjr/article/viewFile/199/198. Access on: 23 Jan.2018.

SOARES, L. A. Analise do jornal popular super noticia sob enfoque critico e multimodal. Alfa, Sao Paulo, v. 61, n.3, p. 575-597, 2017.

TRISTAO, M. B.; MUSSE, C. F. O direito a informacao e o (ainda restrito) espaco cidadao no Jornalismo Popular impresso. Intercom: Rev. Bras. Cienc. Comun., Sao Paulo, v. 36, n. 1, p. 39-59, 2013. DOI: http://dx.doi.org/10.1590/S1809-58442013 000100003.

VALE, O.A. Dictionnaire electronique des conjugaisons des verbes du portugais du Bresil. Paris: Universite Paris 7, 1990. (Rapport Technique du LADL, n.27).

VALE, O. A.; BAPTISTA, J. Novo dicionario de formas flexionadas do UnitexPB: avaliacao da flexao verbal. In: STIL 2015; BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 10., 2015, NATAL. Proceedings of the Conference. Natal: Pontificia Universidade Catolica do Rio de Janeiro, 2015. v. 1, p. 171-180.

VALE, O. A.; CANDIDO JUNIOR, A.; MUNIZ, M.; BENGTSON, C.; CUCATTO, L.; ALMEIDA, G. M. B.; BATISTA, A.; PARREIRA, M. C.; BIDERMAN, M. T. C.; ALUISIO, S. M. Building a large dictionary of abbreviations for named entity recognition in Portuguese historical corpora. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION; WORKSHOP LANGUAGE TECHNOLOGY FOR CULTURAL HERITAGE DATA, 2008, Marrakech. Proceedings [...]. Marrakech: [s. n.], 2008. p. 47-54.

VICENTINI, M. Construcao de corpora de mensagens eletronicas para conversao automatica em fala. Lingua, literatura e ensino, Campinas, v. 5, p. 229-237, out. 2010. Available in: http://revistas.iel.unicamp.br/index.php/lle/article/view/1151. Access on: 28 Jan. 2018.

ZAVAGLIA, C. Extracao de informacoes de definicoes de um dicionario convencional para a elaboracao de uma base de conhecimento lexica: estrategias e procedimentos linguisticos. In: LONGO, B. N. de O.; DIAS DA SILVA, B. C. (Org.) A construcao de dicionarios e de bases de conhecimento lexical. Araraquara: Laboratorio Editorial FCL/UNESP; Sao Paulo: Cultura Academica Editora, 2006. p. 209-234.

ZILIO, L. Verblexpor: um recurso lexico com anotacao de papeis semanticos para o portugues. 196 f. 2015. Thesis (Doctor in Letras)--Universidade Federal do Rio Grande do Sul, 2015.

Received on 23 March, 2018

Approved on 17 June, 2018

Maria Jose Bocomy FINATTO *

Oto Araujo VALE **

Eric LAPORTE ***

* Federal University of Rio Grande do Sul (UFRGS), Porto Alegre--RS--Brasil. PhD in Letters. maria.finatto@ufrgs. br. ORCID: 0000-0002-6022-8408

** Federal University of Sao Carlos (UFSCar), Center for Education and Human Sciences, Sao Carlos--SP--Brasil. Professor of Department of Letters. otovale@ufscar.br. ORCID: 0000-0002-0091-8079

*** Universite Paris-Est (UPE), LIGM, UPEM/CNRS/ESIEE/ENPC, Champs-sur-Marne--France. Institut d'electronique et d'informatique Gaspard-Monge. eric.laporte@univ-paris-est.fr. ORCID: 0000-0002-0984-0781

(1) The notion of word is quite controversial in the context of language studies, as taught by Biderman (1999, 1998). What a word is can be understood in several ways, hence the terms 'lexeme', 'lemma' or 'lexical unit/item/entry', 'vocable', among others, to denote different facets of this concept.

(2) The texts that we used in this experiment can be consulted via the 'context generator' tool in: http://www.ufrgs.br/ textecc/porlexbras/porpopular/experimente.php

(3) This research and its result, which used the DELAF PB dictionary, were recognised as the best work in the Scientific Merit category, during the Meeting of Corpus Linguistics and Brazilian School of Computational Linguistics 2017 (http://www.ufrgs.br/elc-ebralc2017).

(4) [1] This is 'Sample 2' of newspaper Massa in the PorPopular corpus. It is available for download in-- http://www.ufrgs. br/textecc/porlexbras/porpopular/caixaferramentas.php#dadosCorpus

(5) For more information on LADL, see: http://infolingu.univ-mlv.fr/LADL/Historique.html

(6) Cf. http://unitexgramlab.org/pt/relex-network

(7) In the Bahian Carnival, each marching group of revellers is traditionally surrounded by a rope line, and recognisable by the distinctive sleeveless shirt worn by its members. The name Abadabraco incorporates the words for 'hug' and for this type of shirt (TN).
Table 3--Results obtained from DG and MA.

Newspaper                 Diario Gaucho       MASSA!
                         (DG)--sample of   (MA)--sample
                          diverse texts      of news

Spelling                       old           Present
Types                        53.966           22.414
Tokens                       984.465         215.776
With DELAF 2004:
out-of-coverage types        10.512           3.048
% of types                       19,48%         13,60%
out-of-coverage tokens       36.190           11.624
% of tokens                       3,68%          5,39%
With DELAF 2015:
out-of-coverage types         9.967           2.769
% of types                       18,47%         12,35%
out-of-coverage tokens       34.611           10.870
% of tokens                       3,52%          5,04%

Source: Author's elaboration.
COPYRIGHT 2019 Universidade Estadual Paulista
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2019 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:ORIGINAL ARTICLES
Author:Finatto, Maria Jose Bocomy; Vale, Oto Araujo; Laporte, Eric
Publication:Alfa: Revista de Linguistica
Date:Jan 1, 2019
Words:6260
Previous Article:COGNITION, STEREOTYPE, IMAGINATION AND FANTASY IN THE PROCESS OF APPREHENDING THE NEW REALITY THROUGH THE CASTELLAN LEXICON: THE TESTIMONIES OF...
Next Article:MEDIA AS POLITICAL ACTOR OF THE PUBLIC SPHERE: A TEXTUAL ANALYSIS OF VEJA MAGAZINE ON CORRUPTION CASES.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters