Regex adventures: Modifying word lists

April 14, 2020 4-minute read

Programming • R • Italian

As I wrote elsewhere, I am preparing to take the CILS B2 exam at the end of 2020. One natural component of preparation is widen your vocabulary as far as possible. I do this with the help of $\LaTeX$ documents in which I collect relevant words that I pick up divided into four categories: nouns, adjectives, verbs and colloquial phrases.

The format I usually deployed has been as follows,

Italian Word - German Word

I noticed that I would like it better if it would have $\rightarrow$ instead of – and Italian words in bold, i.e.

Italian Word $\rightarrow$ German Word

To my mind, this is clearer and more illustrative. The problem is that I do not want to convert the formatting of close to 1000 words manually as this would mean changing

Italian Word -- German Word \\

\textbf{Italian Word} $\rightarrow$ German Word \\

for up to 1000 words and expressions.

I am not familiar with a replacement function in $\LaTeX$ that takes into account the position of the word in every line. Hence, I had to do it with something else. As I know R quite well and work with it frequently, I used it to read in the complete word list, make all the necessary changes to the strings and output the modified word list in the format that I want. Using the tidyverse this whole process can be written pretty neatly and clearly as

library(stringr)
A = readLines("Test.txt")
Pre = A %>% strsplit("\\\\{4}") %>% 
  sapply(., function(x) x[[1]]) %>% 
  str_trim(., side = c("both")) %>% 
  str_extract(., ".+?(?=-)") %>% 
  str_trim(., side = "both")
Leftside = paste0("\\textbf{", Pre, "}") 
Rightside = A %>% strsplit("\\\\{4}") %>% 
  sapply(., function(x) x[[1]]) %>% 
  str_trim(., side = c("both")) %>% 
  {sub('.*-', '', .)} %>% 
  str_trim(., side = "both")
New.Words = paste(Leftside, "$\\rightarrow$", Rightside, sep = " ")
write.table(New.Words, "New_Words.txt", row.names = FALSE, col.names = FALSE)

So let’s go through all these steps one after another to see what is going on.

First, we read in the data. As toy data, we consider

l'amarezza -- Verbitterung, Bitterkeit \\
il contenuto -- Inhalt \\
l'esordio = l'inizio -- Anfang \\
lo sciocco -- Idiot, Heini \\
la smorfia -- Grimasse, Fratze \\
il biasimo -- Tadel, R�ge \\ 
la cavia -- Versuchskaninchen, Testperson \\
il soprabito -- Mantel \\

and save it in the file Test.txt.

To read in the data, it is not possible to use read.delim(..., sep = "\\\\") because of byte size restrictions! Instead, we use plain readLines() and then, for safety reasons, split the resulting strings according to \\\\ with strsplit().

A = readLines("Test.txt")
Words = sapply(strsplit(A.2, "\\\\{4}"), function(x) x[[1]]) 
Words = str_trim(Words, side = c("both"))

Note that we need to match exactly four backslashes. Otherwise it would interfere with other backslash-commands as well and split the strings illegitimatly.

The command

Words = str_trim(Words, side = c("both"))

makes sure that there is no padding on both ends of the string. Next we need to extract the part of the string in front of --. To do that we need Regex, specifically the pattern .+?(?=-). This matches all strings in front of the first string. Note that we get in trouble when there are Italian words that are separated with a hyphen. For this we would need to specify the number of - exactly. For these purposes, only one - suffices. Next we can put together the command to put the Italian word in boldface.

Leftside = paste0("\\textbf{", Pre, "}")

Note that we need to escape the backslash in order to be print out correctly later on!

As partial output we so far have

> Leftside
[1] "\\textbf{l'amarezza}"           "\\textbf{il contenuto}"        
[3] "\\textbf{l'esordio = l'inizio}" "\\textbf{lo sciocco}"          
[5] "\\textbf{la smorfia}"           "\\textbf{il bi\\d{a}simo}"     
[7] "\\textbf{la cavia}"             "\\textbf{il soprabito}"

Next we need to glue on the left side of the vector. This is done via

Rightside = str_trim(sub(".*-", "", Words), side = "both")

with the regex pattern .*- that similar to the one above chooses everything in front of and including - which we then delete with sub().

Finally, we put it altogether to arrive at

New.Words = paste(Leftside, "$\rightarrow$", Rightside, sep = " ")

and our new words are

> New.Words
[1] "\\textbf{l'amarezza} $\rightarrow$ Verbitterung, Bitterkeit \\\\"   
[2] "\\textbf{il contenuto} $\rightarrow$ Inhalt \\\\"                   
[3] "\\textbf{l'esordio = l'inizio} $\rightarrow$ Anfang \\\\"           
[4] "\\textbf{lo sciocco} $\rightarrow$ Idiot, Heini \\\\"               
[5] "\\textbf{\\textbf{la smorfia}} $\rightarrow$ Grimasse, Fratze \\\\" 
[6] "\\textbf{il bi\\d{a}simo} $\rightarrow$ Tadel, R�ge \\\\"           
[7] "\\textbf{la cavia} $\rightarrow$ Versuchskaninchen, Testperson \\\\"
[8] "\\textbf{il soprabito} $\rightarrow$ Mantel \\\\"