Regex adventures: Modifying word lists
As I wrote elsewhere, I am preparing to take the CILS B2 exam at the end of 2020. One natural component of preparation is widen your vocabulary as far as possible. I do this with the help of $\LaTeX$ documents in which I collect relevant words that I pick up divided into four categories: nouns, adjectives, verbs and colloquial phrases.
The format I usually deployed has been as follows,
I noticed that I would like it better if it would have $\rightarrow$ instead of – and Italian words in bold, i.e.
To my mind, this is clearer and more illustrative. The problem is that I do not want to convert the formatting of close to 1000 words manually as this would mean changing
Italian Word -- German Word \\ to
\textbf{Italian Word} $\rightarrow$ German Word \\ for up to 1000 words and expressions.
I am not familiar with a replacement function in $\LaTeX$ that takes into account the position of the word in every line. Hence, I had to do it with something else. As I know R quite well and work with it frequently, I used it to read in the complete word list, make all the necessary changes to the strings and output the modified word list in the format that I want. Using the tidyverse this whole process can be written pretty neatly and clearly as
library(stringr)
A = readLines("Test.txt")
Pre = A %>% strsplit("\\\\{4}") %>%
sapply(., function(x) x[[1]]) %>%
str_trim(., side = c("both")) %>%
str_extract(., ".+?(?=-)") %>%
str_trim(., side = "both")
Leftside = paste0("\\textbf{", Pre, "}")
Rightside = A %>% strsplit("\\\\{4}") %>%
sapply(., function(x) x[[1]]) %>%
str_trim(., side = c("both")) %>%
{sub('.*-', '', .)} %>%
str_trim(., side = "both")
New.Words = paste(Leftside, "$\\rightarrow$", Rightside, sep = " ")
write.table(New.Words, "New_Words.txt", row.names = FALSE, col.names = FALSE) So let’s go through all these steps one after another to see what is going on.
First, we read in the data. As toy data, we consider
l'amarezza -- Verbitterung, Bitterkeit \\
il contenuto -- Inhalt \\
l'esordio = l'inizio -- Anfang \\
lo sciocco -- Idiot, Heini \\
la smorfia -- Grimasse, Fratze \\
il biasimo -- Tadel, R�ge \\
la cavia -- Versuchskaninchen, Testperson \\
il soprabito -- Mantel \\and save it in the file Test.txt.
To read in the data, it is not possible to use read.delim(..., sep = "\\\\") because of byte size restrictions! Instead,
we use plain readLines() and then, for safety reasons, split the resulting strings according to \\\\ with strsplit().
A = readLines("Test.txt")
Words = sapply(strsplit(A.2, "\\\\{4}"), function(x) x[[1]])
Words = str_trim(Words, side = c("both"))Note that we need to match exactly four backslashes. Otherwise it would interfere with other backslash-commands as well and split the strings illegitimatly.
The command
Words = str_trim(Words, side = c("both"))makes sure that there is no padding on both ends of the string. Next we need to extract the part of
the string in front of --. To do that we need Regex, specifically the pattern .+?(?=-). This matches all strings in front of the first string.
Note that we get in trouble when there are Italian words that are separated with a hyphen. For this we would need to specify the number of - exactly.
For these purposes, only one - suffices. Next we can put together the command to put the Italian word in boldface.
Leftside = paste0("\\textbf{", Pre, "}")Note that we need to escape the backslash in order to be print out correctly later on!
As partial output we so far have
> Leftside
[1] "\\textbf{l'amarezza}" "\\textbf{il contenuto}"
[3] "\\textbf{l'esordio = l'inizio}" "\\textbf{lo sciocco}"
[5] "\\textbf{la smorfia}" "\\textbf{il bi\\d{a}simo}"
[7] "\\textbf{la cavia}" "\\textbf{il soprabito}" Next we need to glue on the left side of the vector. This is done via
Rightside = str_trim(sub(".*-", "", Words), side = "both")with the regex pattern .*- that similar to the one above chooses everything in front of and including - which we then
delete with sub().
Finally, we put it altogether to arrive at
New.Words = paste(Leftside, "$\rightarrow$", Rightside, sep = " ")and our new words are
> New.Words
[1] "\\textbf{l'amarezza} $\rightarrow$ Verbitterung, Bitterkeit \\\\"
[2] "\\textbf{il contenuto} $\rightarrow$ Inhalt \\\\"
[3] "\\textbf{l'esordio = l'inizio} $\rightarrow$ Anfang \\\\"
[4] "\\textbf{lo sciocco} $\rightarrow$ Idiot, Heini \\\\"
[5] "\\textbf{\\textbf{la smorfia}} $\rightarrow$ Grimasse, Fratze \\\\"
[6] "\\textbf{il bi\\d{a}simo} $\rightarrow$ Tadel, R�ge \\\\"
[7] "\\textbf{la cavia} $\rightarrow$ Versuchskaninchen, Testperson \\\\"
[8] "\\textbf{il soprabito} $\rightarrow$ Mantel \\\\"