The day I collected all Milesian inscriptions from the Searchable Greek Inscriptions-tool in a few minutes and then did not do anything with it
This is neither a great feat nor particularly hard, but I figured I would note it here, just in case me or someone else can do anything with this tiny guide. Sometime during the holidays I got bored and thought it would be great to make a sort of diagram about the inscriptions of Miletus – all of them!
Now, we do not (currently) have a database with them, and basically all the information is in books and articles and would take a ridiculous amount of time to compile manually. This is boring, so I decided not to. I did want to try to webscrape the inscriptions.packhum.org-site though, mainly to see what happens. It turns out it is absolutely easy, but the resulting data is not what I want. Anyway, here is how to do it:
The “Searchable Greek Inscriptions”-Tool
The “Searchable Greek Inscriptions”-Tool is a great site hosted by the Packard Humanities Institute. It contains basic information and the text about and of a plethora of inscriptions compiled from different publications. Each inscription has an identifier that is prefixed with “PH”.
So, I navigated to one of the “Inschriften von Milet”-books (Milet VI,2) and saw that the identifiers are consecutive from the first one in the book to the last. Each inscription can also be found by its identifier in the URL, e.g. PH351132 (though without the prefix “PH” for the URL) for the first one, and PH351810 for the last one in this book.
The Basic Idea
Now, if we want to get this into R, we need the html-reading capabilities of the rvest
-package:
We can note one of our identifiers, lets say PH351132 as the first one in the book, and simply get the complete html of the site under this address into memory. We won’t be able to view it, but it doesn’t matter.
The only thing that is left for us – basically – is knowing what we want. This tutorial explains what to look for and how to find it when trying to understand html and get the info you need to select the elements from a html-page that you want to scrape. As an example: In the html of each inscriptions’ page at the Searchable Greek Inscriptions-Tool is a line that links the title of the corresponding book to the books index:
As you can see, the link has the class="booklink"
, and its contents are the (short) name of the book, in this case “Milet VI,2”.
We can extract this from the page we just loaded into memory using the html_element()
-function to select it, and then the html_text()
-function to extract it as a character. (Please note that from here on out, the code needs to be able to use tidyverse-grammar.)
This would be the same result for the second note below the reference on the site:
Looping in a Function
To make it easier for me, I wrote a function that would safe all the information I can get into a dataframe. Because this takes an annoying amount of time, it sometimes echoes how far it has progressed, just to keep track. You feed it only a vector of all the identifiers you want to scrape, and then it does just that.
This is not particularly fast or efficient, so if you plan on getting 163.202 inscriptions, maybe don’t.
To collect your inscriptions, then, just run the function with a bunch of identifiers:
Now we can save this for later as a csv, and take a look at the resulting dataframe, which now contains basically a table of all that information:
book | inv | ph_ref | note_one | note_two | text | url | |
---|---|---|---|---|---|---|---|
351132 | Milet VI,2 | 407 | PH351132 | Epitaph of Androssos (?) of Halikarnassos. Upper part of a small stele of bluish marble. | Ionia — Miletos — 5th c. BC — SbBerlin (1900.1) 111 (mention) — Kadmos 37 (1998) 164-165, n. 3 (cf.) | 1 <U+0391><U+03BD>d<U+03C1><U+03BF>ss<U+0323>- <U+03C9>d<U+03BF><U+03C2> <U+1F09><U+03BB>- <U+03B9><U+03BA>a<U+03C1><U+03BD>a- ss<U+1F73><U+03C9><U+03C2>. | https://inscriptions.packhum.org/text/351132 |
351133 | Milet VI,2 | 408 | PH351133 | Epitaph of Herostratos, son of Python. Block of white marble; late archaic script. | Ionia — Miletos — Kalabaktepe — early 5th c. BC | 1 <U+1F29><U+03C1><U+03BF>st<U+03C1><U+1F71>t<U+03BF> <U+1F10>µ<U+1F76> s<U+1FC6>µa t<U+03BF><U+0342> <U+03A0><U+1F7B><U+03B8><U+03C9><U+03BD><U+03BF><U+03C2>. | https://inscriptions.packhum.org/text/351133 |
351134 | Milet VI,2 | 409 | PH351134 | Epitaph of Leontis. Two joining fragments of a block or plaque of light gray marble, broken on all sides. | Ionia — Miletos — early 5th c. BC | 1 <U+039B><U+1F73><U+03BF><U+03BD>t<U+03B9><U+03C2>. | https://inscriptions.packhum.org/text/351134 |
Obviously, the text did not fare well, but it should not be impossible to reformat it into readability. Sadly (but we could have foreseen this) we do not actually have any information that lends itself to a fast and easy clean-up, but the first step (getting it from the site into your R-memory) is done, at least. If you are so inclined, I guess cleaning the chronological information up in an Excel-Sheet is always a lot easier than to copy and paste everything from the homepage and then STILL clean it manually… But with a rather long and involved process you would also be able to automate it, somewhat. See, as an example, this monstrosity of data cleaning, that may or may not have been easier to do by hand.
Conclusion
This was nice, but also didn’t help in any way. Anyhow – if you could use this function, go ahead and have fun.