teaching: Laura Alonso i Alemany
period: August to November 2008
Computer Science Department at the FaMAF wednesdays and fridays, 11h., room 13

NOTE: the course will be taught in Spanish

keywords: natural language processing, data mining, language technologies, (supervised and unsupervised) machine learning

what this is all about
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means of experimentation.

from Marti Hearst's essay on What is Text Mining?

This course aims to be an introduction to the area of data mining as applied to text, seen from a perspective of natural language processing (NLP). I will describe the area, mostly in relation to well-established areas like information retrieval, data-driven NLP and general data mining. Then, I will present various successful approaches to the discovery of information in text. Through case study we will obtain a general picture of:

  • which information needs need to be covered,
  • which textual properties can be exploited,
  • how theoretical insights about textual properties can be implemented in effective tools or procedures.
For a more concrete idea of the topics to be discussed, check the starting list of papers, which will be completed to meet the interests of students.
The course will work as a seminar, where students are expected to have read the paper(s) assigned for each day's lecture prior to the class. Then, papers will be briefly presented by me, and I will start the discussion by asking some questions about the paper (10% of the final mark). Papers will serve as a starting point to introduce
basic concepts and techniques in the area: alignment, clustering (similarty metrics, clustering algorithms), bootstrapping. Exercices will be proposed to practise these techniques (10% of the final mark). Further reading will be provided for each topic to be discussed.
Graduate students will have to replicate at least one of the experiments presented in the papers, introduce a modification in the method and produce a paper of their own. If more than two graduate students attend the course regularly, they will review each other's papers.
what we will not do (in class)
  • a formal inspection of algorithms or techniques.
  • learn how to use concrete tools, altough students are expected to learn that by themselves as part of the practical exercices.
  • a course on classical natural language processing.
what I expect you to achieve
  • a general overview of the area of text mining, with a bias to empirical NLP
  • some familiarity (and operating capability) with (semi | un)supervised machine learning techniques
  • maturity for criticizing work in the area (leaving written testimony of it)
  • capacity to replicate and enhance already initiated work (graduate students)


practical issues
The course will consist of 120 hours, of which 60 will be lecture hours, 10 will be mentoring and 50 will be covered by the practical exercices to be completed by students. Lecture hours will most be distributed in two-hour sessions twice a week, spanning for four months. .
There will be one big or twa smaller assingments, to be handed at the end of the course. There will be a list of possibilities for students to choose their assignment. The goal of these assignments is that students learn to plan and carry out text mining projects from end to end, that is, from design to evaluation of the results. Graduate students are expected to produce a conference-style paper.
In well motivated cases, a student can present a paper in class instead of carrying out a practical exercise.
The subject will be evaluated as follows: 50% will correspond to practical exercises, 50% to an exam. Practical exercises can be either one big project or two smaller ones. Discussion of papers in class and small homework exercises are of help and can be included within practical exercises.


to know more...
Untangling Text Data Mining

journals, conferences

tools and resources
Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs.
MALLET is a Java toolkit for machine learning applied to natural language. It provides facilities for document classification, information extraction, part-of-speech tagging, noun phrase segmentation, general finite state transducers and classification
The R Project for Statistical Computing: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
NIST/SEMATECH e-Handbook of Statistical Methods, particularly the chapter on exploratory data analysis
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.
There's also a big range in proprietary software for data mining:
clementine (an enterprise data mining workbench that enables you to develop predictive models quickly using business expertise and deploy them into business operations to improve decision making) and text mining for clementine, sas (superior power that gives you the power to know)

a course on Information Retrieval and Web Mining (2005) by Chris Manning and Prabhakar Raghavan, with slides of all the lectures. Deals with: document (and info) indexing, efficient treatment of indexed collections, query expansion, efficient search. Clear crystal intros to basic concepts and techniques: vector space, clustering, classification.

natural language processing

machine learning
a course on Knowledge Discovery in Databases by Howard J. Hamilton