Regex adventures II: Alphabetically sorting word lists

Last time I wrote a blog post about modifying large word lists automatically with the help of regular expressions. After I replaced the symbols with other ones and made typographic changes, I now also want to change the order of the words. Usually, I take down new words immediately when encountering them, that is, without any alphabetical order. Similar to last time, I do not want to do it manually as this might even be more work than just doing typographic changes.

The problem with alphabetical ordering is that a lot of words, especially nouns, start with a definite article, i.e. il, la or l’. When doing the reordering we have to take that into account. For example, as $i$ comes before $l$ in the alphabet, il biasimo would come prior to l’amarezza.

For a pure list of Italian aggettivi, alphabetical ordering is easy as there is no article.

adj = c("immusonito", "buono", "immobile")
> sort(adj)
[1] "buono"      "immobile"   "immusonito"

Note that R sorts the words iteratively in case they match in the first few characters. The same is true for Italian verbs.

For a pure list of Italian nouns, regex comes in. Let Test.txt be the same words as last time, i.e.

l'amarezza -- Verbitterung, Bitterkeit \\
il contenuto -- Inhalt \\
l'esordio = l'inizio -- Anfang \\
lo sciocco -- Idiot, Heini \\
la smorfia -- Grimasse, Fratze \\
il biasimo -- Tadel, R�ge \\ 
la cavia -- Versuchskaninchen, Testperson \\
il soprabito -- Mantel \\

Then the code is

library(magrittr)
A = readLines("Test.txt") 
Pre = A %>% strsplit("\\\\{4}") %>% 
  sapply(., function(x) x[[1]]) %>% 
  stringr::str_trim(., side = c("both")) 
Post = sub(".*?( |\')", "", Pre, perl = TRUE)
Pre.Sorted = Pre[order(Post)]

First, we read in the data and wrangle it in order to have separated word translations. Then, we extract lazily (.*?) that is in front of whitespace or ‘ with the group ( |\'). After we extracted it, we replace it with nothing (""), i.e. delete it and are left with the nouns we really want. These are then ordered and this order is used to order the actual values that include articles.

Finally, when we want to order alphabetically frasi fatte, we have to first test whether there is an article at the start of the word-pair and in that case remove that article so that the phrase is ready for ordering. The solution for this task is actually also a solution for the one before and even more precise. So let’s use it for that.

No.Articles = sub("^(la |il |l\'|lo ){1}", "", Pre, perl = TRUE)
> No.Articles
[1] "amarezza -- Verbitterung, Bitterkeit \\\\"   "contenuto -- Inhalt \\\\"                   
[3] "esordio = l'inizio -- Anfang \\\\"           "sciocco -- Idiot, Heini \\\\"               
[5] "smorfia -- Grimasse, Fratze \\\\"            "biasimo -- Tadel, R�ge \\\\"                
[7] "cavia -- Versuchskaninchen, Testperson \\\\" "soprabito -- Mantel \\\\"   

Here, ^ commands a search at the start of the string, for the group (la |il |l\'|lo ), with each occuring exactly once {1}.

Hence, with the commands above it is easily possible to sort through thousands of words and order them alphabetically taking into account articles that might mess up that ordering.