The day I collected all Milesian inscriptions from the Searchable Greek Inscriptions-tool in a few minutes and then did not do anything with it

January 4, 2023

This is neither a great feat nor particularly hard, but I figured I would note it here, just in case me or someone else can do anything with this tiny guide. Sometime during the holidays I got bored and thought it would be great to make a sort of diagram about the inscriptions of Miletus – all of them!

Now, we do not (currently) have a database with them, and basically all the information is in books and articles and would take a ridiculous amount of time to compile manually. This is boring, so I decided not to. I did want to try to webscrape the inscriptions.packhum.org-site though, mainly to see what happens. It turns out it is absolutely easy, but the resulting data is not what I want. Anyway, here is how to do it:

The “Searchable Greek Inscriptions”-Tool

The “Searchable Greek Inscriptions”-Tool is a great site hosted by the Packard Humanities Institute. It contains basic information and the text about and of a plethora of inscriptions compiled from different publications. Each inscription has an identifier that is prefixed with “PH”.

So, I navigated to one of the “Inschriften von Milet”-books (Milet VI,2) and saw that the identifiers are consecutive from the first one in the book to the last. Each inscription can also be found by its identifier in the URL, e.g. PH351132 (though without the prefix “PH” for the URL) for the first one, and PH351810 for the last one in this book.

The Basic Idea

Now, if we want to get this into R, we need the html-reading capabilities of the rvest-package:

install.packages("rvest")
library(rvest)

We can note one of our identifiers, lets say PH351132 as the first one in the book, and simply get the complete html of the site under this address into memory. We won’t be able to view it, but it doesn’t matter.

id <- 351132
url <-  paste("https://inscriptions.packhum.org/text/", id, sep = "")
page <- read_html(url)
str(page)

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

The only thing that is left for us – basically – is knowing what we want. This tutorial explains what to look for and how to find it when trying to understand html and get the info you need to select the elements from a html-page that you want to scrape. As an example: In the html of each inscriptions’ page at the Searchable Greek Inscriptions-Tool is a line that links the title of the corresponding book to the books index:

<a href="/book/880?location=1688" class="booklink">Milet VI,2</a>

As you can see, the link has the class="booklink", and its contents are the (short) name of the book, in this case “Milet VI,2”.

We can extract this from the page we just loaded into memory using the html_element()-function to select it, and then the html_text()-function to extract it as a character. (Please note that from here on out, the code needs to be able to use tidyverse-grammar.)

library(tidyverse)
page %>%
  html_element(".booklink") %>% 
  html_text()

## [1] "Milet VI,2"

This would be the same result for the second note below the reference on the site:

page %>%
  html_element(".ti") %>% 
  html_text()

## [1] "Ionia — Miletos — 5th c. BC — SbBerlin (1900.1) 111 (mention) — Kadmos 37 (1998) 164-165, n. 3 (cf.)"

Looping in a Function

To make it easier for me, I wrote a function that would safe all the information I can get into a dataframe. Because this takes an annoying amount of time, it sometimes echoes how far it has progressed, just to keep track. You feed it only a vector of all the identifiers you want to scrape, and then it does just that.

This is not particularly fast or efficient, so if you plan on getting 163.202 inscriptions, maybe don’t.

collect_packhum <- function(refs = 351132:351134) {
  collection <- as.data.frame(matrix(nrow = length(refs), ncol = 7))
  colnames(collection) <- c("book", "inv", "ph_ref", "note_one", "note_two", "text", "url")
  rownames(collection) <- refs
  
  print("Starting....")
  # loop for every reference number
  for (i in 1:length(refs)) {
    url <-  paste("https://inscriptions.packhum.org/text/", refs[i], sep = "")
    
    # get the whole page (probably inefficient)
    page <- read_html(url)
    
    # Short name of the Book
    collection$book[i] <- page %>%
      html_element(".booklink") %>% 
      html_text()
    
    # Inventory number as referenced after the book
    collection$inv[i] <- page %>%
      html_element(".ref") %>% 
      html_text()
    
    # packhum reference number, though it is technically the same as refs[i]
    ph_ref <- page %>%
      html_element(".docref") %>% 
      html_text() 
    # remove newline
    collection$ph_ref[i] <- gsub("\\n", "", ph_ref)
  
    # the note (first line of entry)
    collection$note_one[i] <- page %>%
      html_element(".note") %>% 
      html_text()
    
    # the note containing the dating (second line of entry)
    collection$note_two[i] <- page %>%
      html_element(".ti") %>% 
      html_text()
    
    # text of the inscription, will probably have formatting issues
    collection$text[i] <- page %>%
      html_element(".text-nowrap") %>% 
      html_text()
    
    # the url, just in case, though we could easily reconstruct it
    collection$url[i] <- url
    
    # this part is just for keeping track
    prog <- pretty(refs, n = ifelse(length(refs > 100), 100, 10))
    perc <- i / length(refs)
    perc <- round(perc * 100, digits = 1)
    if (refs[i] %in% prog) {
      print(paste("... ", perc, "% done", sep = ""))
    }
  }
  print("Done!")
  return(collection)
}

To collect your inscriptions, then, just run the function with a bunch of identifiers:

refs <- c(351132:351134)
ivm <- collect_packhum(refs)

## [1] "Starting...."
## [1] "Done!"

Now we can save this for later as a csv, and take a look at the resulting dataframe, which now contains basically a table of all that information:

write.csv(ivm, "ivm_table_example.csv")
ivm

	book	inv	ph_ref	note_one	note_two	text	url
351132	Milet VI,2	407	PH351132	Epitaph of Androssos (?) of Halikarnassos. Upper part of a small stele of bluish marble.	Ionia — Miletos — 5th c. BC — SbBerlin (1900.1) 111 (mention) — Kadmos 37 (1998) 164-165, n. 3 (cf.)	1 <U+0391><U+03BD>d<U+03C1><U+03BF>ss<U+0323>- <U+03C9>d<U+03BF><U+03C2> <U+1F09><U+03BB>- <U+03B9><U+03BA>a<U+03C1><U+03BD>a- ss<U+1F73><U+03C9><U+03C2>.	https://inscriptions.packhum.org/text/351132
351133	Milet VI,2	408	PH351133	Epitaph of Herostratos, son of Python. Block of white marble; late archaic script.	Ionia — Miletos — Kalabaktepe — early 5th c. BC	1 <U+1F29><U+03C1><U+03BF>st<U+03C1><U+1F71>t<U+03BF> <U+1F10>µ<U+1F76> s<U+1FC6>µa t<U+03BF><U+0342> <U+03A0><U+1F7B><U+03B8><U+03C9><U+03BD><U+03BF><U+03C2>.	https://inscriptions.packhum.org/text/351133
351134	Milet VI,2	409	PH351134	Epitaph of Leontis. Two joining fragments of a block or plaque of light gray marble, broken on all sides.	Ionia — Miletos — early 5th c. BC	1 <U+039B><U+1F73><U+03BF><U+03BD>t<U+03B9><U+03C2>.	https://inscriptions.packhum.org/text/351134

Obviously, the text did not fare well, but it should not be impossible to reformat it into readability. Sadly (but we could have foreseen this) we do not actually have any information that lends itself to a fast and easy clean-up, but the first step (getting it from the site into your R-memory) is done, at least. If you are so inclined, I guess cleaning the chronological information up in an Excel-Sheet is always a lot easier than to copy and paste everything from the homepage and then STILL clean it manually… But with a rather long and involved process you would also be able to automate it, somewhat. See, as an example, this monstrosity of data cleaning, that may or may not have been easier to do by hand.

Conclusion

This was nice, but also didn’t help in any way. Anyhow – if you could use this function, go ahead and have fun.

Lisa Steinmann

The “Searchable Greek Inscriptions”-Tool

The Basic Idea

Looping in a Function

Conclusion