English 455: Corpus Linguistics

Homework and Links

Brief Homework Assignments:

due Wednesday, Aug 26: HW1. Calculate (guesstimate) how many words a person experiences (hears/reads/produces) over the course of thirty years. What are the major variables that you need? When you are done, take a look at Michael Stubbs's guesstimation in section 6.2 of his essay Collocations and Semantic Profiles: On the Cause of the Trouble With Quantitative Studies. You will also find there references to other, similar estimations. Email me your results on Wednesday.

due Monday, Aug 31: HW2. Go to some of the electronic text sites we visited and download a text or two. Make a directory on your working computer in which to collect your texts. Start thinking about gathering a "test text" and texts for your semester project. What sorts of texts did you find collected? What problems did you have downloading specific texts? Were you able to read them after they had been downloaded? Also search around the web for additional texts that might interest you, such as those of a particular author, genre, language, or historical period. Most corpus research will be genre-related in some way, either looking for genre-specific characteristics or comparing characteristics across genres. Write a paragraph answering these questions and email it to me before class on Monday.

Wed, Sept. 2: HW3. Either download WordSmith Tools, or use the perl text tools available from our Moodle page. Get also the the US Presidents Inaugural Addresses (uspis10.txt) and W's inaugural (inaug-w.txt) and/or W's second inaugural (inaug2-w.txt) and or Obama's first inaugural address. Compare W's or Obama's inaugural address with one or more of his predecessors'. Can you find words or constructions that reflect the speakers' political agenda? Alternatively, what can you learn about the sub-genre "inaugural address"? Some items that are usually interesting are: personal pronouns, patriotic terms like "freedom," "democracy," or "children," politically-inflected words like "diversity," "wealthy," or "global." Write up two things as notes: any interesting observations about the texts, and any functions you'd like to be able to do with the program but can't.

Mon, Sept. 14: HW4. Do exercise 2 from Chapter 1 of Stubbs. You may want to use the BNC sampler or Collins sampler at the links below. Write a summary of your findings and any data you would like to show us in a text file so it can be displayed on the projector in class. Email it to me or upload it to Moodle.

Wed, Sept. 16: HW5. Following Stubbs's examples in section 2.2, discover a lemma whose different word-forms have very different collocations. Hint: think of frequently recurring idioms or phrases that usually take a specific grammatical form, like his example "time-consuming." Try to evaluate if the word could be said to have different meanings in its different forms. Email it to me or upload it to Moodle.

Mon., Sept. 23: HW6. Select one from exercises 1-5 at the end of Chapter 4 of Stubbs and do it using the ANC-NYTimes corpus as your data. Write up the results briefly so we can discuss them on Monday and you can turn them in. Take the opportunity to look at his suggestions for notations, such as how to write our collocate groups (p. 63-4). Please also notice how italics are normally used in linguistics: you italicize a word used as a word, and we are using all-caps to indicate a lemma: There were 25 instances of underwent out of 108 for UNDERGO after duplicates were removed.
If you want to refer to a specific instance or example from the data, use quotations marks, and if citatation is necessary, use the ANC file name:

All the
instances of underwent
refer specifically to medical procedures, except for a case in which
the
Tour de France "standings underwent significant revisions"
(nyt20020713.0155) and an interesting passage in which the hero of
J.M. Coetzee's autobiographical novel Youth
is described as having "underwent London" "in the name of experience"
(nyt20020701.0183), the latter looking in the corpus context like a
clever metaphor.

Midterm Project: Do a lexical profile. First, get a definition from at least two dictionaries (In addition to OED you can use Webster's on line). Next, at the VIEW site you can run concordance searches in each genre sub-corpus (in case there are any differences) and also collocation analysis (don't forget to record which statistical measure you used). If your word has many tokens, you can save the concordance output to a text file, then analyze it further with Wordsmith. Remember that the BNC is British English, so you'll want to examine some American English as well. You can get concordance data from Collins, Webcorp, MICASE, and our ANC-nytimes corpus. You can use our NYTimes corpus as your basic corpus if you wish. Look for standard grammatical contexts in addition to collocates. Finally, analyzing your data will lead you to specific comparisons: examine possible counter examples, compare thesaurus entries (words that are supposedly near synonyms), or particular uses that demonstrate how speakers can employ a lexical prosody covertly. See Stubbs beginning on page 87 for a model using the word undergo. Some interesting possible words include: foment, amid, desist (compare the different forms of the lemma), sunken, wee, happen, sheer.

Mon., Oct. 26: HW7. Create youself a home page in your UCS account. The only catch is that you must work with raw code. If you'd like a head start, go to the practice page. Save it to your own computer either by using File -> Save As in your browser or by doing view codes, then selecting all the text, then using Control-C to copy it to your clipboard, then pasting it to Notepad or your favorite substitute (JEdit is written in Java so can run on any platform; it also has lots of useful plugins. Metapad is the best free one for Windows). Once you have the text in your editor, you can put your own name, links, etc. into the page, and save it with a new name. When you are ready, FTP it back to your UCS account and save it in your public_html directory with the name index.html. Finally, from your account prompt, use the command chmod 644 index.html to give the world permission to visit your page (some FTP programs provide commands that allow you to change permissions). That's it! If you need help with your public_html directory, UCS provides some brief instructions for you. If you are brave and have some extra time, you can use the editor in your UCS account to create a page either from scratch or from the practice page. Copy the files practice.html and vicheat from the en4551/tools directory to your public_html directory. At your prompt, type vi practice.html. The basic idea of vi is that you hit i if you want to type characters and ESC when you are done typing and want to move around, delete characters, or exit the program. View the vicheat for basic directions, or type man vi to view the manual. Using vi is not for the faint of heart. If you have lots of time on your hands you can try learning emacs, which is more friendly once you get the hang of it. But using Metapad plus FTP is sufficient for this course.

Date unknown: Compare the outputs from at least two different POS taggers. First, find or write a short paragraph (20-30 words with punctuation; I suggest taking a paragraph from the text you've been focusing on). You have access to three different taggers. The easiest is the UCREL CLAWS4 WWW Trial Tagging Service, where you simply paste your text into the space provided, then copy and paste the results into a document. The second is Oliver Mason's QTAG, which you can access from your UCS account. You can copy the qtag directory from en4551/tools to your own en4551 directory, or you can easily use it where it is by defining a variable: type set a=/w1/en4551/tools/q-tag, hit enter, then type echo $a and enter to check your variable. Then to use the tagger from anywhere in your account, simply type

java -jar $a/qtag.jar $a/qtag-eng inputfilename > outputfilename

where inputfilename = the file you want to tag. Make sure it is plain text in unix format! Accurate tagging usually requires tokenizing. If you use Mason's tokenizer the tagger will produce different results. To use the tokenizer, type java -jar $a/qtoken.jar inputfilename. The tokenizer will create a file called inputfilename.tok in the working directory. You can use this file as input to the tagger.
The third tagger is Brill's tagger. Here is one online version. There is another online version. In fact, several more demonstration taggers are listed at the Syntax Resources site. Or download the Windows executable, and run it according to the instructions on the WinBrill page. There is also a more primitive version that you can download, unzip, and run by following the directions in the file Manual.txt.
Once you have your tagged texts, then compare them side by side (a spreadsheet program might be good for this), and see which is the most accurate. of course, each program uses a different tagset, so this will complicate your job! Do the best you can and we'll review your results on Friday.

Date unknown: Visit the Oxford English Dictionary online (you must be on campus or connected to the web by the STEP-UP dialup connection, unless you have your own subscription to the site). Look up the word download. Compare the definitions for the noun and the verb, and their earliest attestations (first known appearance in print). Can you infer anything about the relation between written and spoken language from this information? What does the OED say about upload? The words money, toilet, and port (a kind of wine) are often given as examples of metonymy acting in a word's history. Given the OED entries of these words, what is metonymy?

Due November 11: Electronic Project Review. Identify a humanities archival or editorial project that interests you. You might select one that you hope to use in your teaching, one that you have used or will use in your research, or one that you might want to emulate in your own project. Make a thorough tour of the site. Write a review of the site, with an online review repository (like H-Net Reviews) or an online journal (like Postmodern Culture or Electronic Book Review) as the target publication venue. Some questions you might ask include: How available is the site? How easy is it to use? What are the site's goals, and does it meet them? Did you try it in different browsers? How about aesthetics? For what users would you recommend it? As a conceptual outline you might consider Unsworth's (2000) charateristics of thematic research collections as described in chapter 24 of Companion.

Various Link Collections:
David Lee's Corpus-based Linguistics Links is fairly up-to-date
Michael Barlow's Corpus Linguistics Page (dated)
Susan Hockey's course bibliography and links (dated)
Gateway to Corpus Linguistics on the Internet (dated)

Useful Online Corpus Interfaces and Tools:
Collins WordbanksOnline English corpus sampler--does concordances and collocations, but output is limited to 40 lines
BNC Online Service
Register (free) to use Lancaster's BNCweb service to utilize all of the benefits of the XML version of BNC
corpus.byu.edu--Mark Davies's site now includes two useful American English corpora as well as a useful BNC interface
William Fletcher's Phrases in English--A BNC interface that focuses on phrases
Fletcher is also developing a web concordancer
Mike Scott's Web (author of WordSmith Tools)
Michigan Corpus of Academic Spoken English (MICASE) Corpus Search
Michigan Corpus Linguistics Home -- Resources for students, teachers, and researchers
WebCorp Concordancer
International Computer Archive of Modern and Medieval English -- with links to the ICAME journal and corpus CD
UCREL CLAWS4 WWW Trial Tagging Service (take note of word limits)
Stanford NLP online parser demo (single sentences only)
Download page for Oliver Mason's java-based QTAG tagger
TEI-Lite Pizza Chef for creating DTDs
Latent Semantic Analysis Homepage
Turbo Lingo--performs basic text stats on a web page or pasted text
Shakespeare Search Engine--a nice search interface
Metapad -- excellent, free text editor
JEdit is a free text editor that makes a good xml editor
Wordnet -- from the homepage you can open the Wordnet browser or read about the database
FrameNet -- frame semantics for the web
Charles O. Hartman does some interesting things with computers and poetry
Developing Linguistic Corpora: a Guide to Good Practice
Vocabulary Management Profiles--Web implemention of Gilbert Youmans's software discussed by Stubbs in Chap. 6

Archives and Projects:
Dickinson Electronic Archives
Cynthia Hallan's Emily Dickinson Lexicon
The Walt Whitman Archive
Corpus of Middle English Prose and Verse
The Proceedings of the Old Bailey, London 1674 to 1834--fascinating!
(reviewed by historian John Smail at H-Net.org)
American Memory from the Library of Congress
The Edgar Allan Poe Digital Collection
Michael Taft's Blues Lyrics Concordance
Rotunda--UVA Press's electronic imprint
University of Sheffield's Humanities Research Institute Online--has several projects complete and ongoing
McGann's Rossetti Archive--groundbreaking, especially as incorporating images
British Library's Treasures in Full--includes Shakespeare's Quartos and Caxton's Chaucer with side-by-side comparison views
List of projects at the TEI-C
Electronic Literature Directory

Journals:
Literary and Linguistic Computing
ICAME Journal
Corpus Linguistics and Linguistic Theory
Journal of English Linguistics
International Journal of Corpus Linguistics
Journal of Quantitative Linguistics
Corpora

Digital Humanities:
ADHO's Essays in Digital Humanities
Digital Humanities Quarterly -- a new open-access peer-reviewed electronic journal from ADHO
A Companion to Digital Literary Studies (fulltext online)
Digital Humanities 2009 -- See what was going on at the most recent DH conference
TEI Consortium (TEI): http://www.tei-c.org
Association for Computers and the Humanities (ACH): http://www.ach.org
Maryland Institute for Technology in the Humanities: http://www.mith2.umd.edu/
Matt Kirschenbaum's blog
Nick Montfort's ppg256 series -- poetry generators written in 256 characters of perl

Text Repositories:
Oxford Text Archives
Project Gutenberg
Bartleby.com
Internet Archive
Google Books
JSTOR (access restricted)

Basic Unix Commands:
Regular Expressions for Humanists, by Stephen Ramsay
from Stanford--note: if you print using the "lp" commands you will have to walk over to UCS to pick up your job.
from Vermont--note: we don't have the "pine" email program or the "pico" editor. Also, you will not normally be using an X-windows display.
from Oregon--note: again, we do not have pico and I don't recommend printing.
from Oxford--notes: again, unless you go to the Unix lab on campus, you won't be using an X-windows terminal.
from Trinity University--includes some handy shortcuts that don't all work on our system.

General Linguistics:
The Linguist List -- Clearinghouse for linguistics information of all kinds
Peter Patrick's African American English Page -- Contains an especially comprehensive bibliography and a nice page of selected readings.
On-Line English Grammar -- a helpful resource for English grammatical terminology
Lexicon of Linguistics -- explains some linguistics terminology
Poetics and Linguistics Association -- Professional association for stylistics

Return to main course page

Go to Clai's Home Page