due Monday, Aug 31: HW2. Go to some of the electronic text
sites
we
visited
and download a text or two. Make a directory on your working
computer
in which to collect your texts. Start thinking about gathering
a "test text" and texts
for your semester project. What sorts of texts did you find
collected?
What problems did you have downloading specific texts? Were you
able
to read them after they had been downloaded? Also search around
the
web for additional texts that might interest you, such as those of a
particular
author, genre, language, or historical period. Most corpus
research
will be genre-related in some way, either looking for genre-specific
characteristics
or comparing characteristics across genres. Write a paragraph answering
these questions and email it to me before class on Monday.
Wed, Sept. 2: HW3. Either download WordSmith Tools, or use the perl text tools available from our Moodle page. Get also the the US Presidents Inaugural Addresses (uspis10.txt) and W's inaugural (inaug-w.txt) and/or W's second inaugural (inaug2-w.txt) and or Obama's first inaugural address. Compare W's or Obama's inaugural address with one or more of his predecessors'. Can you find words or constructions that reflect the speakers' political agenda? Alternatively, what can you learn about the sub-genre "inaugural address"? Some items that are usually interesting are: personal pronouns, patriotic terms like "freedom," "democracy," or "children," politically-inflected words like "diversity," "wealthy," or "global." Write up two things as notes: any interesting observations about the texts, and any functions you'd like to be able to do with the program but can't.
Mon, Sept. 14: HW4. Do exercise 2 from Chapter 1 of Stubbs.
You may want to use the BNC sampler or Collins
sampler at the
links
below. Write a summary of your findings and any data you would like to
show
us in a text file so it can be displayed on the projector in class.
Email it to me or upload it to Moodle.
Wed, Sept. 16: HW5. Following Stubbs's examples in section 2.2, discover a lemma whose different word-forms have very different collocations. Hint: think of frequently recurring idioms or phrases that usually take a specific grammatical form, like his example "time-consuming." Try to evaluate if the word could be said to have different meanings in its different forms. Email it to me or upload it to Moodle.
Mon., Sept. 23: HW6. Select
one
from exercises 1-5 at the end of Chapter
4 of
Stubbs and do it using the ANC-NYTimes corpus as your data. Write up
the results briefly so we can discuss them on Monday and you can turn
them in. Take the opportunity to look at his suggestions for notations,
such as how to write our collocate groups (p. 63-4). Please also notice
how italics are normally used in linguistics: you italicize a word used
as a word, and we are using all-caps to indicate a lemma:
There were 25
instances of underwent out of
108 for UNDERGO after duplicates were removed.
If you want to refer to
a specific instance or example from the data, use quotations marks, and
if citatation is necessary, use the ANC file name:
Midterm Project: Do a lexical profile. First, get a
definition
from at least two dictionaries (In addition to OED you can use
Webster's
on line). Next, at the VIEW site you can run
concordance
searches in each genre sub-corpus (in case there are any differences)
and
also
collocation analysis (don't forget to record which statistical measure
you
used). If your word has many tokens, you can save the concordance
output
to a text file, then analyze it further with Wordsmith. Remember
that the BNC is British English, so you'll want to examine some
American English as well. You can get concordance data from Collins,
Webcorp, MICASE, and our ANC-nytimes corpus. You can use our NYTimes
corpus as your basic corpus if you wish. Look for standard grammatical
contexts in addition to collocates.
Finally,
analyzing your data will lead you to specific comparisons: examine
possible
counter examples, compare thesaurus entries (words that are supposedly
near
synonyms), or particular uses that demonstrate how speakers can employ
a
lexical prosody covertly. See Stubbs beginning on page 87 for a
model
using the word undergo. Some interesting possible words
include:
foment, amid, desist (compare the different
forms of the lemma),
sunken,
wee, happen, sheer.
Mon., Oct. 26: HW7. Create youself a home page in your UCS account. The only catch is that you must work with raw code. If you'd like a head start, go to the practice page. Save it to your own computer either by using File -> Save As in your browser or by doing view codes, then selecting all the text, then using Control-C to copy it to your clipboard, then pasting it to Notepad or your favorite substitute (JEdit is written in Java so can run on any platform; it also has lots of useful plugins. Metapad is the best free one for Windows). Once you have the text in your editor, you can put your own name, links, etc. into the page, and save it with a new name. When you are ready, FTP it back to your UCS account and save it in your public_html directory with the name index.html. Finally, from your account prompt, use the command chmod 644 index.html to give the world permission to visit your page (some FTP programs provide commands that allow you to change permissions). That's it! If you need help with your public_html directory, UCS provides some brief instructions for you. If you are brave and have some extra time, you can use the editor in your UCS account to create a page either from scratch or from the practice page. Copy the files practice.html and vicheat from the en4551/tools directory to your public_html directory. At your prompt, type vi practice.html. The basic idea of vi is that you hit i if you want to type characters and ESC when you are done typing and want to move around, delete characters, or exit the program. View the vicheat for basic directions, or type man vi to view the manual. Using vi is not for the faint of heart. If you have lots of time on your hands you can try learning emacs, which is more friendly once you get the hang of it. But using Metapad plus FTP is sufficient for this course.
Date unknown: Compare the outputs from at least two different POS taggers. First, find or write a short paragraph (20-30 words with punctuation; I suggest taking a paragraph from the text you've been focusing on). You have access to three different taggers. The easiest is the UCREL CLAWS4 WWW Trial Tagging Service, where you simply paste your text into the space provided, then copy and paste the results into a document. The second is Oliver Mason's QTAG, which you can access from your UCS account. You can copy the qtag directory from en4551/tools to your own en4551 directory, or you can easily use it where it is by defining a variable: type set a=/w1/en4551/tools/q-tag, hit enter, then type echo $a and enter to check your variable. Then to use the tagger from anywhere in your account, simply type
java -jar $a/qtag.jar $a/qtag-eng inputfilename > outputfilenamewhere inputfilename = the file you want to tag. Make sure it is plain text in unix format! Accurate tagging usually requires tokenizing. If you use Mason's tokenizer the tagger will produce different results. To use the tokenizer, type java -jar $a/qtoken.jar inputfilename. The tokenizer will create a file called inputfilename.tok in the working directory. You can use this file as input to the tagger.
Date unknown: Visit the Oxford English Dictionary online
(you must
be on campus or connected to the web by the STEP-UP dialup
connection,
unless you have your own subscription to the site). Look up the
word
download. Compare the definitions for the noun and the
verb,
and their earliest attestations (first known appearance in
print). Can
you infer anything about the relation between written and spoken
language
from this information? What does the OED say about upload?
The
words money, toilet, and port (a kind of wine)
are often
given as examples of metonymy acting in a word's history.
Given
the OED entries of these words, what is metonymy?
Due November 11: Electronic Project Review. Identify a humanities archival or editorial project that interests you. You might select one that you hope to use in your teaching, one that you have used or will use in your research, or one that you might want to emulate in your own project. Make a thorough tour of the site. Write a review of the site, with an online review repository (like H-Net Reviews) or an online journal (like Postmodern Culture or Electronic Book Review) as the target publication venue. Some questions you might ask include: How available is the site? How easy is it to use? What are the site's goals, and does it meet them? Did you try it in different browsers? How about aesthetics? For what users would you recommend it? As a conceptual outline you might consider Unsworth's (2000) charateristics of thematic research collections as described in chapter 24 of Companion.
Various Link Collections:
David Lee's Corpus-based Linguistics
Links is fairly up-to-date
Michael Barlow's Corpus
Linguistics
Page (dated)
Susan Hockey's course bibliography
and links (dated)
Gateway to Corpus
Linguistics
on the Internet (dated)
Useful Online Corpus Interfaces and Tools:
Collins
WordbanksOnline
English corpus sampler--does concordances and collocations, but
output
is limited to 40 lines
BNC Online Service
Register (free) to use
Lancaster's BNCweb service to utilize all of the benefits of the XML version of BNC
corpus.byu.edu--Mark Davies's site
now includes two useful American English corpora as well as a useful
BNC interface
William Fletcher's Phrases in English--A
BNC interface that focuses on phrases
Fletcher is also developing a web
concordancer
Mike Scott's Web
(author of WordSmith Tools)
Michigan Corpus of Academic Spoken English (MICASE) Corpus
Search
Michigan Corpus Linguistics Home
-- Resources for students, teachers, and researchers
WebCorp Concordancer
International Computer Archive of Modern
and Medieval English -- with links to the ICAME journal and corpus
CD
UCREL CLAWS4 WWW
Trial Tagging Service (take note of word limits)
Stanford NLP online
parser demo (single sentences only)
Download page for Oliver Mason's java-based QTAG tagger
TEI-Lite Pizza Chef for
creating
DTDs
Latent Semantic Analysis Homepage
Turbo Lingo--performs
basic text stats on a web page or pasted text
Shakespeare
Search
Engine--a nice search interface
Metapad -- excellent,
free
text editor
JEdit is a
free text editor that makes a good xml
editor
Wordnet -- from the
homepage you can open the Wordnet browser or read about the database
FrameNet -- frame
semantics
for the web
Charles O. Hartman
does some interesting things with computers and poetry
Developing
Linguistic Corpora: a Guide to Good Practice
Vocabulary
Management Profiles--Web implemention of Gilbert Youmans's software
discussed by Stubbs in Chap. 6
Archives and Projects:
Dickinson Electronic Archives
Cynthia Hallan's Emily
Dickinson Lexicon
The Walt Whitman Archive
Corpus of Middle English
Prose
and Verse
The Proceedings of the Old
Bailey,
London 1674 to 1834--fascinating!
(reviewed
by
historian John Smail at H-Net.org)
American Memory
from
the Library of Congress
The Edgar Allan Poe Digital
Collection
Michael Taft's Blues Lyrics
Concordance
Rotunda--UVA Press's
electronic
imprint
University of Sheffield's Humanities
Research Institute Online--has several projects complete and ongoing
McGann's Rossetti
Archive--groundbreaking,
especially as incorporating images
British Library's Treasures in
Full--includes
Shakespeare's Quartos and Caxton's Chaucer with side-by-side comparison
views
List of projects at
the
TEI-C
Electronic Literature
Directory
Journals:
Literary and Linguistic
Computing
ICAME Journal
Corpus Linguistics
and Linguistic Theory
Journal of English Linguistics
International
Journal of Corpus Linguistics
Journal
of Quantitative Linguistics
Corpora
Digital Humanities:
ADHO's Essays in
Digital
Humanities
Digital Humanities
Quarterly -- a new open-access peer-reviewed electronic journal
from ADHO
A Companion to
Digital Literary Studies (fulltext online)
Digital Humanities
2009 -- See what was going on at the most recent DH conference
TEI Consortium (TEI): http://www.tei-c.org
Association for Computers and the Humanities (ACH): http://www.ach.org
Maryland Institute for Technology in the Humanities: http://www.mith2.umd.edu/
Matt Kirschenbaum's blog
Nick Montfort's ppg256 series -- poetry
generators written in 256 characters of perl
Text Repositories:
Oxford Text Archives
Project Gutenberg
Bartleby.com
Internet Archive
Google Books
JSTOR (access restricted)
General Linguistics:
The Linguist List --
Clearinghouse
for linguistics information of all kinds
Peter
Patrick's
African American English Page -- Contains an especially
comprehensive
bibliography and a nice page of selected readings.
On-Line
English
Grammar -- a helpful resource for English grammatical terminology
Lexicon of
Linguistics
-- explains some linguistics terminology
Poetics
and Linguistics Association -- Professional association for
stylistics
Go to Clai's Home Page