Friday, Aug. 22: Go to some of the electronic text sites we visited and download a text or two. Make a directory on your working computer in which to collect your texts. Start thinking about gathering texts for your semester project. What kinds of texts did you find collected? What problems did you have downloading specific texts? Were you able to read them after they had been downloaded? Also search around the web for additional texts that might interest you, such as those of a particular author, genre, language, or historical period. Most corpus research will be genre-related in some way, either looking for genre-specific characteristics or comparing characteristics across genres.
Wed, Aug. 30: Download WordSmith Tools, and ftp from your class project folder the US Presidents Inaugural Addresses (uspis10.txt) and W's inaugural (inaug-w.txt) and/or W's second inaugural (inaug2-w.txt). Compare W's inaugural address with one or more of his predecessors'. Can you find words or constructions that reflect the speakers' political agenda? Alternatively, what can you learn about the sub-genre "inaugural address"? Some items that are usually interesting are: personal pronouns, patriotic terms like "freedom," "democracy," or "children," politically-inflected words like "diversity," "wealthy," or "global." Write up two things as notes: any interesting observations about the texts, and any functions you'd like to be able to do with the program but can't.
Wed, Sept. 13: Do exercise 1 or 3 from Chapter 2 of Stubbs.
You
will probably want to use the BNC sampler or Collins sampler at the
links
below. Write a summary of your findings and any data you would like to
show
us in a text file so it can be displayed on the projector in class.
Mon., Sept. 25: Select one from exercises 1-5 at the end of Chapter
4 of
Stubbs and do it using the ANC-NYTimes corpus as your data. Write up
the results briefly so we can discuss them on Monday and you can turn
them in. Take the opportunity to look at his suggestions for notations,
such as how to write our collocate groups (p. 63-4). Please also notice
how italics are normally used in linguistics: you italicize a word used
as a word, and we are using all-caps to indicate a lemma:
There were 25
instances of underwent out of
108 for UNDERGO after duplicates were removed.
If you want to refer to
a specific instance or example from the data, use quotations marks, and
if citatation is necessary, use the ANC file name:
Midterm Project: Do a lexical profile. First, get a
definition
from at least two dictionaries (In addition to OED you can use
Webster's
on line). Next, at the VIEW site you can run
concordance
searches in each genre sub-corpus (in case there are any differences)
and
also
collocation analysis (don't forget to record which statistical measure
you
used). If your word has many tokens, you can save the concordance
output
to a text file, then analyze it further with Wordsmith. Remember
that the BNC is British English, so you'll want to examine some
American English as well. You can get concordance data from Collins,
Webcorp, MICASE, and our ANC-nytimes corpus. You can use our NYTimes
corpus as your basic corpus if you wish. Look for standard grammatical
contexts in addition to collocates.
Finally,
analyzing your data will lead you to specific comparisons: examine
possible
counter examples, compare thesaurus entries (words that are supposedly
near
synonyms), or particular uses that demonstrate how speakers can employ
a
lexical prosody covertly. See Stubbs beginning on page 87 for a
model
using the word undergo. Some interesting possible words
include:
foment, amid, desist (compare the different
forms of the lemma),
sunken,
wee, happen, sheer.
Wednesday, Nov. 1: Create youself a home page in your UCS account. The only catch is that you must work with raw code. If you'd like a head start, go to the practice page. Save it to your own computer either by using File -> Save As in your browser or by doing view codes, then selecting all the text, then using Control-C to copy it to your clipboard, then pasting it to Notepad or your favorite substitute (Metapad is the best free one for Windows). Once you have the text in your editor, you can put your own name, links, etc. into the page, and save it with a new name. When you are ready, FTP it back to your UCS account and save it in your public_html directory with the name index.html. Finally, from your account prompt, use the command chmod 644 index.html to give the world permission to visit your page (some FTP programs provide commands that allow you to change permissions). That's it! If you need help with your public_html directory, UCS provides some brief instructionsfor you. If you are brave and have some extra time, you can use the editor in your UCS account to create a page either from scratch or from the practice page. Copy the files practice.html and vicheat from the en4551/tools directory to your public_html directory. At your prompt, type vi practice.html. The basic idea of vi is that you hit i if you want to type characters and ESC when you are done typing and want to move around, delete characters, or exit the program. View the vicheat for basic directions, or type man vi to view the manual. Using vi is not for the faint of heart. If you have lots of time on your hands you can try learning emacs, which is more friendly once you get the hang of it. But using Metapad plus FTP is sufficient for this course.
Date unknown: Compare the outputs from at least two different POS taggers. First, find or write a short paragraph (20-30 words with punctuation). You have access to three different taggers. The easiest is the UCREL CLAWS4 WWW Trial Tagging Service, where you simply paste your text into the space provided, then copy and paste the results into a document. The second is Oliver Mason's QTAG, which you can access from your UCS account. You can copy the qtag directory from en4551/tools to your own en4551 directory, or you can easily use it where it is by defining a variable: type set a=/w1/en4551/tools/q-tag, hit enter, then type echo $a and enter to check your variable. Then to use the tagger from anywhere in your account, simply type
java -jar $a/qtag.jar $a/qtag-eng inputfilename > outputfilenamewhere inputfilename = the file you want to tag. Make sure it is plain text in unix format! Accurate tagging usually requires tokenizing. If you use Mason's tokenizer the tagger will produce different results. To use the tokenizer, type java -jar $a/qtoken.jar inputfilename. The tokenizer will create a file called inputfilename.tok in the working directory. You can use this file as input to the tagger.
Date unknown?: Visit the Oxford English Dictionary online
(you must
be on campus or connected to the web by the STEP-UP dialup
connection,
unless you have your own subscription to the site). Look up the
word
download. Compare the definitions for the noun and the
verb,
and their earliest attestations (first known appearance in
print). Can
you infer anything about the relation between written and spoken
language
from this information? What does the OED say about upload?
The
words money, toilet, and port (a kind of wine)
are often
given as examples of metonymy acting in a word's history.
Given
the OED entries of these words, what is metonymy?
Due November 8: Electronic Project Review. Identify a
humanities
archival or editorial project that interests you. You might select one
that
you hope to use in your teaching, one that you have used or will use in
your
research, or one that you might want to emulate in your own project.
Make
a thorough tour of the site. Write a review of the site, with an online
review
repository (like H-Net Reviews)
or an online journal (like Postmodern
Culture
or Electronic Book
Review)
as the target publication venue. Some questions you might ask include:
How
available is the site? How easy is it to use? What are the site's
goals,
and does it meet them? Did you try it in different browsers? How about
aesthetics?
For what users would you recommend it?
Various Link Collections:
Michael Barlow's Corpus
Linguistics
Page
Susan Hockey's course bibliography
and links
Hermann Moisl's corpus
linguistics
links
Gateway to Corpus
Linguistics
on the Internet
Cathy Ball's excellent tutorial
Useful Online Corpus Interfaces and Tools:
Collins
WordbanksOnline
English corpus sampler--does concordances and collocations, but
output
is limited to 40 lines
BNC Online Service
VIEW--Mark Davies's BNC interface
William Fletcher's Phrases in English--A
BNC interface that focuses on phrases
Mike Scott's Web
(author of WordSmith Tools)
Michigan Corpus of Academic Spoken English Corpus
Search
WebCorp Concordancer
UCREL CLAWS4 WWW
Trial Tagging Service
Stanford NLP online
parser demo
Instructions for Oliver Mason's QTAG tagger
TEI-Lite Pizza Chef for
creating
DTDs
Latent Semantic Analysis Homepage
Turbo Lingo--performs
basic text stats on a web page or pasted text
Shakespeare
Search
Engine--a nice search interface
Metapad -- excellent,
free
text editor
An Introduction
to
XML and the TEI--tutorial from MITH using JEdit as xml
editor
Wordnet Search
2.1
FrameNet -- frame
semantics
for the web
Charles O. Hartman
does some interesting things with computers and poetry
Developing
Linguistic Corpora: a Guide to Good Practice
Archives and Projects:
Dickinson Electronic Archives
Cynthia Hallan's Emily
Dickinson Lexicon
The Walt Whitman Archive
Corpus of Middle English
Prose
and Verse
The Proceedings of the Old
Bailey,
London 1674 to 1834--fascinating!
(reviewed
by
historian John Smail at H-Net.org)
American Memory
from
the Library of Congress
Michael Taft's Blues Lyrics
Concordance
Rotunda--UVA Press's
electronic
imprint
University of Sheffield's Humanities
Research Institute Online--has several projects complete and ongoing
McGann's Rossetti
Archive--groundbreaking,
especially as incorporating images
British Library's Treasures in
Full--includes
Shakespeare's Quartos and Caxton's Chaucer with side-by-side comparison
views
List of projects at
the
TEI-C
Electronic Literature
Directory
Digital Humanities:
ADHO's Essays in
Digital
Humanities
TEI Consortium (TEI): http://www.tei-c.org
Association for Computers and the Humanities (ACH): http://www.ach.org
Maryland Institute for Technology in the Humanities: http://www.mith2.umd.edu/
Text Repositories:
Oxford Text Archives
Project Gutenberg
Bartleby.com
General Linguistics:
The Linguist List --
Clearinghouse
for linguistics information of all kinds
Peter
Patrick's
African American English Page -- Contains an especially
comprehensive
bibliography and a nice page of selected readings.
On-Line
English
Grammar -- a helpful resource for English grammatical terminology
Lexicon of
Linguistics
-- explains some linguistics terminology
Poetics
and Linguistics Association -- Professional association for
stylistics
Go to Clai's Home Page