I recently discovered Early Modern Print (‘discovered’ is perhaps the wrong word: I noticed Brodie Waddell and David Hitchcock talking about it on facebook). This website provides easy-to-use programmes, including the EEBO Spelling Browser, for analysing text available from Early English Books Online and the related Text Creation Partnership.  The Spelling Browser, as designer Anupam Basu explains in this post, measures the frequency of ‘n-grams’, which are ‘contiguous sequences of tokens’, i.e. words or letters, and can be set to different levels of complexity to search for phrases as well as individual words. Naturally, the first thing I did was drop in a search for the terms ‘seaman’, ‘seamen’, ‘mariner’ and ‘mariners’. The results surprised me.

Mariners graph

EEBO Spelling Browser search for ‘mariner,mariners,seaman,seamen’, 10-year moving average; original here

In my Ph.D. thesis, I claimed – because I thought so at the time – that ‘mariner’ was a more technical term, appearing in legal documents like court records and wills alongside its Latin counterpart nauta, but that in general use ‘seaman’ was the more common word, summing up the key ideas of masculinity and separation from society ashore which characterised this stereotype. This was based, I admit, on a fairly impressionistic reading of the available evidence, especially printed sources like pamphlets, newsbooks, ballads, and government publications. I toyed with the idea of statistical analysis, perhaps using the ballads in the English Broadside Ballad Archive (and I wonder if integrating the text from that archive, or others like the Burney Newspapers Collection, would change these results), but I did not have the time to pursue what was, for that project, a pretty minor point. It seems that I was wrong. The graph shows that, for most of the sixteenth and seventeenth centuries, ‘mariner’ and ‘mariners’ were more commonly printed words – only towards the turn of the eighteenth century do ‘seaman’ and ‘seamen’ overtake them.

There are, as Basu explains in his post (and also in this one), some inherent difficulties in these graphs which mean the results cannot be taken as straightforward. They do not show the ‘hits’ for any given word, because the increase in printed material across the period would result in considerably higher numbers for later years that might dwarf and conceal the significance of earlier appearances. Instead, they show percentage of total text – none of these words rise above 0.00006 per cent of the printed words in any given year. A few appearances of a word in a year with a low number of publications would create a potentially unrepresentative spike. On this graph, the period roughly 1580-1600 shows a higher percentage for ‘mariners’ than after 1640, but the total number of texts before 1600 was much smaller than in the later period, as shown in this graph, also by Basu. Basu therefore emphasizes that it is always important to check the text behind these visualizations.

There are deeper problems with the collections behind them, too. Variable early modern spellings, preserved in EEBO and the TCP, can make word-searches pretty complex. This is assuming that the text is accurate: although TCP  – as it says on their website, ‘transcribed by hand’ – is fairly reliable, the text-recognition in other collections could raise problems if this sort of technique were to be applied elsewhere (I have experienced problems text-searching in the Burney Newspapers, for example). More importantly, EEBO is not a complete collection of printed material from this period: that graph linked in the last paragraph shows numbers both for EEBO and the English Short Title Catalogue, demonstrating just how small a proportion of early modern English texts is available on EEBO. Even if all of the contents of the ESTC (or any other catalogue) were uploaded and transcribed, there would still be the question of published works that have not survived in any library or private collection, the number of which we can only guess at. These caveats make the conclusions drawn from such graphs rather tentative – but in the absence of more complete evidence, they are probably the best we can achieve.

Which brings me, finally, to the problem of interpretation. What does this visualization mean, if it means anything at all? This graph might show quite a minor point, but then it depends how much meaning you read into the use of different words.  It certainly means I need to rethink my ideas – I have been using ‘seaman’ as a shorthand for the professional stereotype of seafarers, on the assumption that this was how contemporaries would have described them, but now it seems that was not the case. Explaining the trends revealed in the graph will require working between the statistics and the original documents. What sort of texts do these words appear in? Pamphlets, newsbooks, official proclamations? Who is using these words, and why? I would guess that the 1580-1600 rise is due to war between England and Spain, and that the shift from ‘mariner/mariners’ to ‘seaman/seamen’ at the end of the period is due to the increasingly public debates at that time about recruitment into the growing Royal Navy. When you plot ‘mariners’, ‘seamen’ and ‘navy’, the percentage for ‘navy’ (the green line on the graph) rises after 1620, and ‘seamen’ begins to rise a little later, with both continuing to increase from about 1680 onwards. What exactly the relationship is between these two linguistic changes, and whether they have any deeper significance, is something I will have think about.

EEBO Spelling Browser search for 'mariners,seamen,navy', 10-year moving average' original here.

  1. That’s amazing. Thank you for sharing. Started throwing queries through it and finding some really interesting results. I wonder too how much the methodology can be trusted – or how much it can really tell us – but it’s definitely a useful tool for either creating attractive demonstrations of general broad trends or for opening up new avenues for exploration.

    I want to use one of these alongside my ‘printing east indies’ network that I showed last week…

  2. I would love to see the graphs that you produce… I see it as exactly that – a useful way to test ideas and perhaps prompt new ones, and as long as the caveats are understood, I think it makes a valid research technique (what methodology doesn’t have caveats?). I wonder if we will start to see more of these searches in research publications as well as online? It’s sort of what Phil Withington did in Society in Early Modern England, using the ESTC and clearly defined search terms. Although I sometimes feel that with statistics it is easier to disprove than to prove – because of making that final leap to interpretation.

  3. It’s an amazing little tool isn’t it? As you say, in the post: “The results surprised me.” I think that nicely encapsulates the value of the exercise. So many of both our minor assumptions and major conclusions are based on an impressionistic sense of how people at the time wrote (and talked, and thought) about things. This tool shows us immediately that we need to be very careful with those assumptions.
    It’s also useful because eventually, as all of the remaining EBBO texts are fully transcribed, it will allow us to look at the print corpus as a whole rather than relying on our reading which will tend to be focused on the most well-known or canonical texts.
    Still, for all the reasons you give, it can only take us so far. I wouldn’t really trust the 16th century results because the number of early texts is so small. It also privileges long works (because they will have more ‘tokens’, i.e. words, per text). Moreover even once all the EBBO texts are added, it will still miss manuscripts which will mean that less ‘public’ discussions (e.g. seditious, domestic, etc.) will be missed.
    People are already pushing further with these sorts of analyses. You mentioned Phil’s work, but see also this new article: Alexis D. Litvine (2014). ‘The Industrious Revolution, The Industriousness Discourse, And The Development Of Modern Economies’, Historical Journal, 57, pp 531-570. I don’t think I actually buy this particular take, but it’s certainly interesting to see it moving into mainstream journals.
    (Also, as you know, some of this came up in the our ‘History from Below’ discussions last summer, e.g.

  4. Really interesting post Richard, thanks. I think this methodology is certainly thought-provoking, and can raise some useful questions, but I’m still to be convinced that we can draw many conclusions from it.
    These are, of course, tiny percentages, so using them as evidence of ‘common usage’ of terms is problematic. And although it has the potential to tell us about changes in print culture, that again is not the same as saying there is a change in common usage or the way contemporaries described something: it may be that printed forms favoured mariner for much of the century, but that does not mean seaman was not more commonly used in, for instance, the language of self-description. So I wouldn’t say this necessarily shows you were wrong in your ‘impressionistic’ reading – your reading of the sources was done with a greater sensitivity to the context of use than the results of this graph have.
    That said, I’m certainly going to have a play around with this myself and see what leads it throws up. I’m also starting to think that a workshop on ‘counting words’ might be a good way to finesse the way we use this methodology – it is certainly a growing practice and an interesting one…

  5. Thanks Brodie, and thanks Mark! Great words of caution and reassurance at the same time. There is certainly a danger – especially when looking at a nice neat graph – to equate printed words with a larger picture which they do not represent. If many of the appearances of ‘mariners’ (and indeed ‘seamen’) come from government publications, as I suspect they do, then that does not necessarily get us very close to what was said ‘on the street’. Perhaps specific printed sources might be better for this – ballads spring to mind, but Mark, you know more about them than I do – so that is something to take into account. It would be interesting if you could represent these differences in graphs or other visualisations, too – although that brings us nicely into the pitfalls of arbitrary categorisation… On the other hand, I think as long as the limits of the conclusions are clearly stated, there is no reason why these shouldn’t be added to the historians’ toolbox. A workshop on these (and similar?) methodologies would be a great idea.

    I wonder about the larger implications of these sorts of tools, too. As you say, Brodie, they can incorporate less ‘canonical’ texts, but they also force us to confront just how large the corpus of printed sources is (and as you also say, this doesn’t even get started on manuscripts!). Does this erode the old idea – never fully articulated but, I think, implicit in my graduate training – that a historian should read everything relevant to the topic and then distil some conclusions from it, simply because it becomes clear that ‘everything relevant’ could well be too much for one historian? Will it contribute to the much-commented-upon narrowing of research topics, or on the other hand force more collaborative projects? Or will it become accepted practice to use computer search results as a form of evidence? Things seem to be moving that way, although I don’t think it is very widely accepted yet. Perhaps it’s not such a great departure as it might appear, because historians have always done sampling of one kind or another; but maybe new kinds of ‘sample’, based on electronic research tools, will become more common. In this – as in much else – I am sure there are people who have come to these ideas long before me, and thought about them much harder too.

    • Yes, it’s obviously not a new issue. People like Tim Hitchcock et al have been thinking/talking about this for a long time now. But I think it is only now going ‘mainstream’ in that it’s moving out of the Digital Humanities literature and into conventional historiography.

      If you’re willing to organise a event, I’ll be there!

      • I don’t think I am knowledgeable enough to lead a workshop on this – one blog post does not an expert make – but I would definitely attend, and I’d be happy to help put it on (hint, hint)…

