No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de...

7
No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty with NLP Francisco Barreras Mónica Ribero Felipe Suárez x

Transcript of No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de...

Page 1: No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty

No. 4FEBRERO de 2017

Documentosde Trabajo

Edición electrónica

Improving transparency of the ColombianPeace Treaty with NLPFrancisco BarrerasMónica RiberoFelipe Suárez

x

Page 2: No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty

ACUERDO DE CONFIDENCIALIDAD

El presente Acuerdo de Confidencialidad (el “Acuerdo”) es suscrito el día __ de ________ de 20__, entre

QUANTIL SAS una compañía debidamente constituida bajo las leyes de la República de Colombia, con

domicilio en la ciudad de Bogotá D.C. NIT 900225936-1 y dirección en la carrera 7 No. 77-07 de la

misma ciudad (en adelante “LA COMPAÑIA”); y _______________________, identificado con cédula de

ciudadanía No. __________ de _______ con domicilio en la ciudad de _________ y dirección en

______________________ (en caso de ser persona jurídica “obrando en nombre y representación de la

sociedad “_____________________” domiciliada en ___________, constituida por medio de la escritura

pública No. ________ de ____ fecha _______________ de la Notaría ________ de _________, inscrita

en la Cámara de Comercio de ___________, bajo el No. ________ del Libro _____ y matriculada en el

Registro Mercantil de la mencionada Cámara de Comercio bajo el No. _______________, debidamente

autorizado para suscribir el presente documento; todo de acuerdo al certificado expedido por la Cámara

de Comercio de ________ el cual se adjunta para que forme parte integrante de este contrato), quien en

adelante se denominará EL CONTRATISTA declaro lo siguiente:

CLÁUSULAS

PRIMERA: DEFINICIONES: Para los efectos del presente ACUERDO, el siguiente será el sentido que

se le da a los términos que a continuación se indican:

INFORMACIÓN significará cualquier tipo de datos o información, bien sean escritos, gráficos, visuales,

orales o contenidos o en medios electrónicos, o que hayan circulado por la red o redes de comunicación,

internas o externas de LA COMPAÑIA o de terceros, o que hayan sido expuestos, comunicados, puestos

en conocimiento, o entregados por LA COMPAÑIA a EL CONTRATISTA que se relacionen con las

operaciones, la producción, los productos, las actividades, o los servicios de LA COMPAÑIA. La

INFORMACIÓN es propiedad exclusiva de LA COMPAÑIA, y por lo tanto tiene carácter confidencial.

En la INFORMACIÓN se incluyen sin limitación alguna, todas las características, descripciones, datos,

productos, procesos y operaciones, métodos, fórmulas, entrenamiento, know-how, ideas, mejoras y

Serie Documentos de Trabajo Quantil, 2017-4Edición electrónica.

FEBRERO de 2017

Comité editorial:Francisco Barreras, Investigador JuniorDiego Jara, CoDirector General y Director Matemáticas FinancierasJuan David Martin, Investigador JuniorÁlvaro J. Riascos, CoDirector General y Director Modelos Económicos e I&DNatalia Serna, Investigadora Junior

c© 2017, Quantil S.A.S., Estudios Económicos,Carrera 7 # 77 - 07. Oficina 901, Bogotá, D. C., ColombiaTeléfonos: +57(1) 805 1814E-mail: [email protected]://www.quantil.com.co

Impreso en Colombia – Printed in Colombia

La serie de Documentos de Trabajo Quantil se circula con propósi-tos de discusión y divulgación. Los artículos no han sido evaluadospor pares ni sujetos a ningún tipo de evaluación formal por parte delequipo de trabajo de Quantil.

Publicado bajo licencia:

Atribución – Compartir igual

Creative Commons: https://co.creativecommons.org

x

Page 3: No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty

Improving transparency of the Colombian Peace Treaty with NLPA Tool for understanding, navigating and summarizing the Colombian Peace Treaty∗

Francisco BarrerasQuantil

Bogotá, [email protected]

Mónica RiberoQuantil

Bogotá, [email protected]

Felipe SuárezQuantil

Bogotá, [email protected]

February 20, 2017

ABSTRACTFactorization methods and probabilistic models provideuseful ways to represent text that can capture propertieslike sentence relevance, topics in text and even semanticsimilarity. In general, methods that yield a low-dimensionalrepresentation of large volumes of text have become moreimportant and have gained attention in a diversity of fieldssince they have the potential of assisting in the under-standing of vast ammounts of information. The ColombianPeace treaty documented an exhaustive list of agreementsbetween the FARC guerilla and Colombian Armed Forcesacross six main sections spanning 297 pages. Surprisingly,the final version of the treaty was made public only for40 days before Colombians had to vote for its approval ina plebiscite. Given that most of the general populationprobably would not read the document and the politicalenvironment (including the media) was highly polarized,there was a growing need for an unbiased and practicalway to review the document before the vote. This paperdescribes the technical details behind the implementationof a tool that analyses the 2016 Colombian peace treaty.By combining Natural Language Processing techniques wewere able to provide a web-service that helped increasetransparency and unbiased reviewing of each section ofthe peace treaty.

CCS CONCEPTS• Information Systems → Data Mining

KEYWORDSGloVe, Latent Dirichlet Allocation, Natural LanguageProcessing, NMF, Peace Treaty.

ACM Reference format:Francisco Barreras, Mónica Ribero, and Felipe Suárez.2017. Improving transparency of the Colombian PeaceTreaty with NLP. In Proceedings of ACM Woodstockconference, Halifax, Nova Scotia - Canada, August 2017(KDD’17), 6 pages.

1 IntroductionFor more than 5 decades Colombia has witnessed a

burdensome and cruel armed conflict with the FARC

(Colombian Revolutionary Armed Forces), a communistguerrilla. This conflict has been worsened and prolongedby circumstances like drug dealing, international financialsupport for the guerrilla and challenging topography. In2016, the two parties finally reached a succesful conclusionto peace negotiations after over 10 failed attempts in thepast. The peace treay needed to be approved by thegeneral population in a plebiscite on October the 2nd 2016,only 40 days after its publication.

Several failed attempts of agreement have resulted indisastrous consequences such as increased violence, forceddisplacements and the consolidation of drug trafficking[10]. Encouraged by the positive disposal of both sides toagree to participate of a peaceful war ending treaty, wepropose a novel tool that enables citizens all across thecountry to extract and summarize the relevant informationin the treaty in a free and open web service. We deploymultiple Natural Language Processing methodologies toaid end users capture the most relevant information abouta specific topic of their interest within the treaty.

We have seen some applications that targeted sim-ilar goals for summarization, but mostly for practicalapplications in business and industry, this project wasa chance to apply such techniques for social good andincreased transparency. Since the early days of NaturalLanguage Processing, starting with the works of Mani[6], attempts to extract the most general information oflarge corpora has improved substantially. Not surprisingly,several authors and newspapers released their own –humanversion– summaries claiming to be politically unbiasedand exhaustive. Even if we trust their claims, drawingup a succinct text that encompasses the key informationof a large corpus is a very time consuming task prone topersonal biases. It also leads to static texts that may ormay not satisfy everyone’s personal preferences.

In this paper we use a plain text version of the originalagreement1 together with a large corpus consisting of arti-cles from the press, twitter feeds, books of public accessand legislative texts. In section 2 we detail the theoreticalconcepts that support the algorithmic implementationsthat we explain later in section 3. In section 4 we dis-cuss the results of our implementations under multiplequeries. Our conclusions of the work and possible futureimprovements are presented in section 5.

∗ The deployment of the tool is found in the website: www.acuerdosdepaz.co1 The PDF version can be found in http://www.acuerdodepaz.gov.co/acuerdos/acuerdo-final.

3

Page 4: No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty

2 Theoretical FrameworkWe use several topic discovery techniques that vectorize

words and documents as Latent Dirichlet Allocation andGloVe. These methods were used to filter the sentencesin the treaty by semantic similarity to a user’s query. Afactorization method (NMF) was then applied to the re-sulting subset from which a relevance score was computedfor each sentence and a chart was produced to visualizethe proportion of different topics. The relevance score wasfurther used to filter the resulting sentences and producea summary of varying length, displaying sentences abovea relevance threshold in order of appearance. We will nowelaborate further on each step of the process.

2.1 Latent Dirichlet AllocationConsider a collection of documents D = {d1, d2, ..., dn}

where each document d is a sequence of Nd wordsd = (wd1

, ..., wdNd) from a vocabulary V = {w1, ..., wN}.

Latent Dirichlet Allocation is an NLP technique proposedby [2] to reveal latent dimensions in corpus D, calledtopics, or in general on any collection of discrete data. Inturn, each topic is a distribution over the words in corpus’svocabulary.

The main idea is that document d is generated by adistribution θd ∈ ∆k−1 ∼ Dirichlet(α) of topics and eachtopic is a distribution of words φj ∈ ∆N−1 for j = 1, ..., kassuming the following procedure:• Choose a topic assignment zi,d ∼ Multinomial(θ)

for each word i = 1, ..., Nd in document d• Choose a word wk ∼ Multinomial(φzi,d)

Consequently, topic j is discussed in document d withprobability θj,d and each word i is discussed in topic jwith probability φi,j for i = 1, ..., w and j = 1, ..., k. Thesedistributions are learned from data using unsupervisedlearning; we used the lda package in R that performsCollapsed Gibbs Sampling [3].

p(T |θ, φ) =∏d∈D

Nd∏j=1

φj,zj,dθzj,d,d (1)

These probabilities distributions can be, however, seenas vectorizations of words and documents. This will berelevant later when we present the way we matched userquery’s to relevant sentences.

2.2 Semantic word embeddingsGloVe is a word embedding model proposed by [7].

The model combines global matrix factorization and localcontext window methods to capture semantic and syntacticproperties of words. However, the approach is differentfrom LDA since it is based in word co-occurrence windowsand not in the whole document.

GloVe is based on probabilities of word co-ocurrence.More specifically, let i, j, k three different words, then pik

pjk

should be close to one if and only if both words i, and jcoocur in the same proportion with word k. Otherwise,this ratio will be either close to zero or bigger than one.Based on these facts, [7] propose a least squares regression

model with cost function

J(W ) =

V∑i,j=1

f(Xij)(wTi w̃k + bi + b̃j − logXij)

2 (2)

where• V is the size of the vocabulary• X is the co-occurrence word matrix• f is a weighting function that allows to deal with

rare or too frequent co-occurrences.• wi, w̃ ∈ Rd are word and context word embeddings

respectively.• b and b̃ are bias terms.

2.3 Sentence RankingMany summarization techniques are based on a sen-

tence ranking algorithm and then by chosing the top Msentences where M is a longitude parameter chosen bythe user. Here, we used a Matrix factorization techniqueproposed by [5] to uncover dimensions of sentences repre-senting semantic features that are later used to compute arelevance measure to rank sentences. The algorithm is asfollows.

1. Construct the term - sentence matrix A ∈ Rm×n

2. The matrix is factorized to uncover k semantic fea-tures for each sentence A = WH where W ∈ Rm×k

and H ∈ Rk×n

3. The "Generic Relevance for Sentences" (GRS) iscomputed for each sentence where

GRS(j) =

k∑i=1

Hijweight(Hi∗)

weight(Hi∗) =

∑nq=1Hiq∑k

p=1

∑nq=1Hpq

4. Chose the M sentences with higher GRS.Here the GRS computes how much relevant topics are

discussed in each sentence. k is a parameter chosen by themodeler.

3 ImplementationIn this section we discuss the approach we took in

dealing with the two proposed objectives. The peacetreaty is divided into six main sections regarding differentaspects of the termination of the armed conflict togetherwith several other sections of appendixes comprising 297pages. The most relevant information is, thus, containedwithin the 6 main sections and we extract our corpus fromit. We also gathered a large corpus consisting of articlesfrom the press, twitter feeds, books of free access andlegislative texts to extract the vectorial representation forwords.

After properly preprocessing the texts in order to re-move unwanted punctuation or non-ascii characters, un-wanted numbering, hashtags, links and repeated spaces,we performed the following procedures:

1. Trained word representation using LDA and GloVe.2. Vectorized the Peacy treaty by section.

4

Page 5: No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty

3. Computed pretrained topics.The first step of the data processing was training the

vector representation of words. For LDA, we constructeda Document-Term matrix (DTM) from individual proposi-tions as documents. Propositions were parsed by splittingparagraphs in separate sentences divided by period orsemicolon. Additionally, we removed stopwords and sweptthe number of topics, k, in the range {30, 50, 100, 300}.

For GloVe representation, we constructed the TermCoocurrence Matrix (TCM), X, using windows of lengthk in the range {3, 5, 7, 9}. The corpus from which wetrained these algorithms was sought to be spanish textsregarding topics about politics, war and narcotraffic; thecomplete corpus consisted of the peace treaty, 12 books,9800 paragraphs from Constitutional Court sentences,2000 paragraphs on FARC related news, and 50000 tweets.

Calibration of these parameters was done empiricallyby evaluating the coherence of the group of similar wordsfor each word in a test set. Our test set consists of thewords: campo, conflicto, derechos, farc, internacional,justicia, lesa, militares, paramilitares, participación, patria,paz, verdad, víctimas. Neighboring words are calculatedusing the cosine distance.

The second step of the processing was to vectorizeeach section of the treaty. This helped us relate eachpart of the text with a point in Rk and consequently beable to retrieve all related parts of the text by findingits cosine-neighborhood. Sentences were vectorized byaveraging the vectors corresponding to each word in thesentence.

φ(p) =1

|p|∑w∈p

φ(w). (3)

For each possible section s ∈ {1, . . . , 6} we will call Wsto the vectorized texts within section s,

Ws = {φ(p), p ∈ s} (4)

Following these processing procedures, we included twoalgorithms to retrieve specific queries based on particularpreferences. There are two type of queries –topic andsummary– to which we perform the algorithms ?? and ??.

The last data processing procedure was to computea list of pretrained topics. This enabled us to retrievethe most searched summaries and distributions instantlywithout needing to do any vectorization or neighborhoodcalculation. We will call predefined topics as P and therelevances to the section s as rl,s:

P = {φ(q1), . . . , φ(ql)},

ri,s = Summarize(qi, s, ε, d).

Identification of semantic relation of an incoming querywithin P is done using cosine similarity. If,

maxp∈P

cos(φ(q),p) > η,

then ri,s is returned as summary of q instead of

Summarize(q, s, ε, d).

4 Use examplesResults of sample queries are displayed in detail in this

section as well as illustrations of the two resulting vectorembeddings implemented. We discuss the differences ofeach vectorization and suggest additional improvementsthat we could leverage for future work. We show thefinal tool released as a website publicly days after thepublication of the treaty at the end of the section.

Figure 1: Illustration of LDA vectorization of wordsprojected onto its three strongest components.

Figure 2: Illustration of GloVe vectorization of wordsprojected onto its three strongest components.

5

Page 6: No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty

Empirical assesment of the optimal parameters aftersweeping the number of topics and window size for theLatent Dirichlet Allocation gave a vectorization that wedisplay projected in a three dimensional afine transforma-tion via Principal Component Analysis, PCA [4]. PCAhelps us visualize the axes that maximze indivial varianceand minimize cross correlation. We display the wordsclosest to the test set word list in Figure 2. The colors ofthe points correspond to neighborhoods that we may wantto reveal semantic clusters.

Figure 3: Illustration of the vectorized treaty by sec-tion projected onto its three strongest components.

Figure 4: Final interface.

For instance, in figure 3 we illustrate each part of thetreaty after the vectorization transformation.

Figure 5: Final interface, wordcloud

Figure 6: Final interface, summary.

Figure 7: Final interface, distribution.

6

Page 7: No. 4quantil.co/wp-content/uploads/2017/08/Working-Paper-4...No. 4 FEBRERO de 2017 Documentos de Trabajo Edición electrónica Improving transparency of the Colombian Peace Treaty

All algorithms were implemented in R language [8].Likewise, the final user tool consisted of the websitewww.acuerdosdepaz.co constructed with the DJangoweb framework [9].

5 AcknowledgmentsWe would like to thank Simón Ramírez, Carlos Cortés

and Sebastián Terán for helping us develop the front endof the website. We would like to thank Diego Jara for theidea of using NLP techniques to facilitate the reading ofthe peace agreements. Also, for his comments and supportthat, along with those of Álvaro Riascos, helped to developthe tool as a user consumable good. Darío Correal pointedout valuable comments during the implentation stage forwhich we are thankful. Finally, we would like to thankQuantil for providing infrastructure and financial support.

6 ConclusionsWe introduced a methodology that allowed citizens to

explore the peace agreements between Colombian govern-ment and the FARC after a sixty years’ conflict. It wasa complex and unstructured legal text that people shouldreview in a one month’s period before deciding their votefor the plebiscite. Our aim was to ease the reading ratherthan replacing it; to achieve this we developed tools tocontextualize people with principal factors, and to easilyand quickly find sections concerning a particular topicand/or section of the documents.

Our methodology used different Natural Language Pro-cessing and Data Mining techniques that included GloVe,Latent Dirichlet Allocation and Non-Negative Matrix Fac-torization. It allowed users the opportunity to visualizeand understand documents compositions by using topicdistributions and to make automatic summaries.

The tool could generalize to explore other documentsproviding a friendly way to ensure people is informed. Byusing these tools in an ethical way one could avoid politi-cal interests and media to bias and interfere with people’sdecisions. Here we presented some examples of the results.

Future directions could include improvement in perfor-mance by introducing other classification techniques asNeural Networks. Other summarization techniques thatimprove coverage could also be explored; in the afore-mentioned methodology we used a NMF technique forsummaries because its computational simplicity and effi-ciency. However, one could compare results with othertechniques; for example, using Differential evolution algo-rithms to optimize coverage while minimizing redundancyhave proven to be successful in some corpora [1].

Other directions that one could explore to enlarge anal-ysis could include the possibility to compare other docu-ments they provide and/or media documents to prove theiraccuracy. Also, to use Named Entity Recognition tech-niques to give precise responses to questions concerningamounts or specific actors.

We believe that tools like the one presented are impor-tant and should be developed in order to provide peoplea way to interact directly with documents they usuallyconsult with other people or media that can have biasedopinions and may omit information for personal interests.In other scenarios it would also be helpful to advice peoplethat usually does not have access to legal assessment indifferent contexts.

References[1] Rasim M Alguliyev, Ramiz M Aliguliyev, and Nijat R

Isazade. An unsupervised approach to generatinggeneric summaries of documents. Applied Soft Com-puting, 34:236–250, 2015.

[2] David M Blei, Andrew Y Ng, and Michael I Jordan.Latent dirichlet allocation. Journal of machine Learn-ing research, 3(Jan):993–1022, 2003.

[3] Jonathan Chang. lda: Collapsed Gibbs SamplingMethods for Topic Models, 2015. R package version1.4.2.

[4] Ian Jolliffe. Principal component analysis. WileyOnline Library, 2002.

[5] Ju-Hong Lee, Sun Park, Chan-Min Ahn, and DaehoKim. Automatic generic document summarizationbased on non-negative matrix factorization. Informa-tion Processing & Management, 45(1):20–34, 2009.

[6] Inderjeet Mani. Automatic summarization, volume 3.John Benjamins Publishing, 2001.

[7] Jeffrey Pennington, Richard Socher, and Christo-pher D Manning. Glove: Global vectors for wordrepresentation. In EMNLP, volume 14, pages 1532–43, 2014.

[8] R Core Team. R: A Language and Environment forStatistical Computing. R Foundation for StatisticalComputing, Vienna, Austria, 2016.

[9] R Foundation for Statistical Computing, Lawrence,Kansas. Django [Computer Software], 2016.

[10] Juan Camilo Restrepo and MA Bernal. La cuestiónagraria. Bogotá: Colección Penguin, 2014.

7