teaching: Laura Alonso i Alemany
period: March to June 2006
site: Computer Science
Department at the FaMAF
mondays and wednesdays, 18h., room 11
NOTE: the course will be taught in Spanish
keywords: natural language processing, data mining, language technologies, (supervised and unsupervised) machine learning
 schedule
 what this is all about
 what we will talk about
 what you should read
 practical issues (schedule, evaluation, assignments, etc.)
 to know more...
what this is all about
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means of experimentation.This course aims to be an introduction to the area of data mining as applied to text, seen from a perspective of natural language processing (NLP). I will describe the area, mostly in relation to wellestablished areas like information retrieval, datadriven NLP and general data mining. Then, I will present various successful approaches to the discovery of information in text. Through case study we will obtain a general picture of:
 which information needs need to be covered,
 which textual properties can be exploited,
 how theoretical insights about textual properties can be implemented in effective tools or procedures.
For a more concrete idea of the topics to be discussed, check the starting list of papers, which will be completed to meet the interests of students.
The course will work as a seminar, where students are expected to have read the paper(s) assigned for each day's lecture prior to the class. Then, papers will be briefly presented by me, and I will start the discussion by asking some questions about the paper (10% of the final mark). Papers will serve as a starting point to introduce basic concepts and techniques in the area: alignment, clustering (similarty metrics, clustering algorithms), bootstrapping. Exercices will be proposed to practise these techniques (10% of the final mark). Further reading will be provided for each topic to be discussed.
Graduate students will have to replicate at least one of the experiments presented in the papers, introduce a modification in the method and produce a paper of their own. If more than two graduate students attend the course regularly, they will review each other's papers.
what we will not do (in class)
 a formal inspection of algorithms or techniques.
 learn how to use concrete tools, altough students are expected to learn that by themselves as part of the practical exercices.
 a course on classical natural language processing.
what I expect you to achieve
 a general overview of the area of text mining, with a bias to empirical NLP
 some familiarity (and operating capability) with (semi  un)supervised machine learning techniques
 maturity for criticizing work in the area (leaving written testimony of it)
 capacity to replicate and enhance already initiated work (graduate students)
what we will talk about
 what is and what is not data mining?
 natural language processing: classical architectures, datadriven solutions
 evaluation: how do you evaluate what you still don't know?
 datadriven characterization of linguistic phenomena
 generalization by typing: word classes (CLUSTERING) [reading]
 discovering your vocabulary: word association (HYPOTHESIS TESTING) [reading]
 you'll know a word by the company it keeps (CLASSIFICATION vs. CLUSTERING)
 reading between the lines (LATENT SEMANTIC ANALYSIS) [reading]
 other forms of partitioning (GENETIC ALGORITHMS) [reading]
 one thing after the other... (language as a sequence)
 ... or everything jumbled (text as a graph)
 enhancing resources (BOOTSTRAPPING) [reading]
 what next?
what you should read

 Ken Church and Patrick Hanks. 1990. Word
Association Norms, Mutual Information, and
Lexicography. Computational Linguistics Vol. 16 (1),
pp.2229
 Dan Yarowsky. 1997. Unsupervised
Word Sense Disambiguation Rivaling Supervised Methods
ACL'97
 Chris Manning. 1993.
Automatic acquisition of a large subcategorisation
dictionary from corpora ACL'93
EXTENSIONT. Briscoe, J. Carroll. 1997 Automatic extraction of subcategorization from corpora Proceedings of the 5th ACL
 M.L. Forcada. (2001) Corpusbased
stochastic finitestate predictive text entry for reduced
keyboards: application to Catalan, Procesamiento del
Lenguaje Natural, (XVII Congreso de la Sociedad Española
de Procesamiento del Lenguaje Natural, Jaén, Spain,
1214.09.2001) 27, 6570
 W. A. Gale and Ken Church. 1991. Identifying
word correspondences in parallel texts
EXTENSIONA. Venugopal, S. Vogel, A. Waibel. 2003 Effective Phrase Translation Extraction from Alignment Models ACL 2003
F.J. Och, H. Ney. (2000) Improved Statistical Alignment Models ACL 2000  In this paper, the authors present and compare various singleword based aligment models for statistical machine translation: IBM alignment models, the Hidde Markov alignment model, smoothing techniques and various modifications.
A Statistical MT Tutorial Workbook prepared in connection with the JHU summer workshop. The basic text that this tutorial relies on is Brown et al, "The Mathematics of Statistical Machine Translation", Computational Linguistics, 1993.
 Regina Barzilay and Kathy McKeown. 2001. Extracting Paraphrases from a Parallel Corpus. ACL'01
practical issues
timing
The course will consist of 120 hours, of which 60 will be lecture hours, 10 will be mentoring and 50 will be covered by the practical exercices to be completed by students. Lecture hours will most probably be distributed in twohour sessions twice a week, spanning for four months. Concrete days and times for lectures will be set to meet attendant's needs.
assignments
There will be a minimum of 2 big assingments, one to be handed after Easter and another at the end of the course. There will be a list of possibilities for students to choose their assignment. The goal of these assignments is that students learn to plan and carry out text mining projects from end to end, that is, from design to evaluation of the results. Graduate students are expected to produce a conferencestyle paper.
Besides, exercices will be proposed every two weeks, to be handed in two weeks. The goal of these weekly exercices is that students learn how to use data mining tools and to deal with text. In well motivated cases, a student can present a paper in class instead of carrying out a practical exercise.
One of the corpora that you can use (in case you don't have any better option):
evaluation
The subject will be evaluated as follows: 60% will
correspond to practical exercises, 30% to a written exam,
10% to exercices proposed in class to practice the basics
of data mining techniques (to be done as homework) and 10%
will correspond to answers given in class during the
discussion of papers.
to know more...
Untangling Text Data Miningintroductory
journals, conferences
 the Association for Computational Linguistics and, more interestingly, its anthology, a digital archive of research papers in computational linguistics covering ACL, EACL, NAACL, CL, HLT and ACLrelated workshops.
people
tools and resources
Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs.
MALLET is a Java toolkit for machine learning applied to natural language. It provides facilities for document classification, information extraction, partofspeech tagging, noun phrase segmentation, general finite state transducers and classification
The R Project for Statistical Computing: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
the NIST/SEMATECH eHandbook of Statistical Methods, particularly the chapter on exploratory data analysis
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It is also wellsuited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.
There's also a big range in proprietary software for data mining: clementine (an enterprise data mining workbench that enables you to develop predictive models quickly using business expertise and deploy them into business operations to improve decision making) and text mining for clementine, sas (superior power that gives you the power to know)
applications
a course on Information Retrieval and Web Mining (2005) by Chris Manning and Prabhakar Raghavan, with slides of all the lectures. Deals with: document (and info) indexing, efficient treatment of indexed collections, query expansion, efficient search. Clear crystal intros to basic concepts and techniques: vector space, clustering, classification.
natural language processing
 J. Allen (1987) Natural Language Understanding. The Benjamin/Cummings Publishin Company Inc. [online version]
 C. Manning and H. Schütze (1999) Foundations of Statistical Natural Language Processing MIT Press.
 D. Jurafsky and J. H. Martin (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition PrenticeHall.
machine learning
a course on Knowledge Discovery in Databases by Howard J. Hamiltoninference
 T.K. Landauer, S.T. Dumais.
A Solution to Plato's Problem: The Latent Semantic
Analysis Theory of Acquisition, Induction and
Representation of Knowledge Psychological Review,
1997.