teaching:
Laura
Alonso i Alemany
period: August to November 2008
site:
Computer
Science Department
at the
FaMAF
wednesdays and fridays, 11h., room 13
NOTE: the course will be taught in Spanish
keywords: natural language processing, data mining,
language technologies, (supervised and unsupervised)
machine learning
- schedule
- what this is all about
- practical issues (schedule, evaluation, assignments, etc.)
- to know more...
what this is all about
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means of experimentation.
from
Marti
Hearst's
essay on
What
is Text Mining?
This course aims to be an
introduction
to the area of data mining as applied to text, seen from a
perspective of natural language processing (NLP). I will
describe the area, mostly in relation to well-established
areas like information retrieval, data-driven NLP and
general data mining. Then, I will present various
successful approaches to the discovery of information in
text. Through case study we will obtain a general picture
of:
- which information needs need to be covered,
- which textual properties can be exploited,
- how theoretical insights about textual properties can be implemented in effective tools or procedures.
The course will work as a seminar, where students are expected to have read the paper(s) assigned for each day's lecture prior to the class. Then, papers will be briefly presented by me, and I will start the discussion by asking some questions about the paper (10% of the final mark). Papers will serve as a starting point to introduce basic concepts and techniques in the area: alignment, clustering (similarty metrics, clustering algorithms), bootstrapping. Exercices will be proposed to practise these techniques (10% of the final mark). Further reading will be provided for each topic to be discussed.
Graduate students will have to replicate at least one of the experiments presented in the papers, introduce a modification in the method and produce a paper of their own. If more than two graduate students attend the course regularly, they will review each other's papers.
what we will not do (in class)
- a formal inspection of algorithms or techniques.
- learn how to use concrete tools, altough students are expected to learn that by themselves as part of the practical exercices.
- a course on classical natural language processing.
- a general overview of the area of text mining, with a bias to empirical NLP
- some familiarity (and operating capability) with (semi | un)supervised machine learning techniques
- maturity for criticizing work in the area (leaving written testimony of it)
- capacity to replicate and enhance already initiated work (graduate students)
practical issues
timing
The
course will consist of 120 hours, of which 60 will be
lecture hours, 10 will be mentoring and 50 will be covered
by the practical exercices to be completed by students.
Lecture hours will most be distributed in two-hour sessions
twice a week, spanning for four months. .
assignments
There
will be one big or twa smaller assingments, to be handed at
the end of the course. There will be a list of
possibilities for students to choose their assignment. The
goal of these assignments is that students learn to plan
and carry out text mining projects from end to end, that
is, from design to evaluation of the results. Graduate
students are expected to produce a conference-style paper.
In well motivated cases, a student can present a paper in
class instead of carrying out a practical exercise.
evaluation
The
subject will be evaluated as follows: 50% will correspond
to practical exercises, 50% to an exam. Practical exercises
can be either one big project or two smaller ones.
Discussion of papers in class and small homework exercises
are of help and can be included within practical exercises.
to know more...
Untangling
Text Data Mining
journals, conferences
- the Association for Computational Linguistics and, more interestingly, its anthology, a digital archive of research papers in computational linguistics covering ACL, EACL, NAACL, CL, HLT and ACL-related workshops.
people
tools and resources
Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs.
MALLET is a Java toolkit for machine learning applied to natural language. It provides facilities for document classification, information extraction, part-of-speech tagging, noun phrase segmentation, general finite state transducers and classification
The R Project for Statistical Computing: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
the NIST/SEMATECH e-Handbook of Statistical Methods, particularly the chapter on exploratory data analysis
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.
There's also a big range in proprietary software for data mining: clementine (an enterprise data mining workbench that enables you to develop predictive models quickly using business expertise and deploy them into business operations to improve decision making) and text mining for clementine, sas (superior power that gives you the power to know)
applications
a course on Information Retrieval and Web Mining (2005) by Chris Manning and Prabhakar Raghavan, with slides of all the lectures. Deals with: document (and info) indexing, efficient treatment of indexed collections, query expansion, efficient search. Clear crystal intros to basic concepts and techniques: vector space, clustering, classification.
natural language processing
- J. Allen (1987) Natural Language Understanding. The Benjamin/Cummings Publishin Company Inc. [on-line version]
- C. Manning and H. Schütze (1999) Foundations of Statistical Natural Language Processing MIT Press.
- D. Jurafsky and J. H. Martin (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Prentice-Hall.
machine learning
a course on Knowledge Discovery in Databases by Howard J. Hamilton
inference
-
T.K. Landauer, S.T. Dumais.
A
Solution to Plato's Problem: The Latent Semantic Analysis
Theory of Acquisition, Induction and Representation of
Knowledge
Psychological Review, 1997.
top