teaching: Laura Alonso i Alemany

period: March to June 2006

site: Computer Science Department at the FaMAF
mondays and wednesdays, 18h., room 11

NOTE: the course will be taught in Spanish

keywords: natural language processing, data mining, language technologies, (supervised and unsupervised) machine learning

what this is all about

Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means of experimentation.
from Marti Hearst's essay on What is Text Mining?

This course aims to be an introduction to the area of data mining as applied to text, seen from a perspective of natural language processing (NLP). I will describe the area, mostly in relation to well-established areas like information retrieval, data-driven NLP and general data mining. Then, I will present various successful approaches to the discovery of information in text. Through case study we will obtain a general picture of:

  • which information needs need to be covered,
  • which textual properties can be exploited,
  • how theoretical insights about textual properties can be implemented in effective tools or procedures.

For a more concrete idea of the topics to be discussed, check the starting list of papers, which will be completed to meet the interests of students.

The course will work as a seminar, where students are expected to have read the paper(s) assigned for each day's lecture prior to the class. Then, papers will be briefly presented by me, and I will start the discussion by asking some questions about the paper (10% of the final mark). Papers will serve as a starting point to introduce basic concepts and techniques in the area: alignment, clustering (similarty metrics, clustering algorithms), bootstrapping. Exercices will be proposed to practise these techniques (10% of the final mark). Further reading will be provided for each topic to be discussed.

Graduate students will have to replicate at least one of the experiments presented in the papers, introduce a modification in the method and produce a paper of their own. If more than two graduate students attend the course regularly, they will review each other's papers.

what we will not do (in class)

  • a formal inspection of algorithms or techniques.
  • learn how to use concrete tools, altough students are expected to learn that by themselves as part of the practical exercices.
  • a course on classical natural language processing.

what I expect you to achieve

  • a general overview of the area of text mining, with a bias to empirical NLP
  • some familiarity (and operating capability) with (semi | un)supervised machine learning techniques
  • maturity for criticizing work in the area (leaving written testimony of it)
  • capacity to replicate and enhance already initiated work (graduate students)


what we will talk about

  1. what is and what is not data mining?
  2. natural language processing: classical architectures, data-driven solutions
  3. evaluation: how do you evaluate what you still don't know?
  4. data-driven characterization of linguistic phenomena
    1. generalization by typing: word classes (CLUSTERING) [reading]
    2. discovering your vocabulary: word association (HYPOTHESIS TESTING) [reading]
    3. you'll know a word by the company it keeps (CLASSIFICATION vs. CLUSTERING)
      1. word sense disambiguation [reading]
      2. subcategorization acquisition [reading]
      3. statistical machine translation [reading]
      4. paraphrasing [reading]
    4. reading between the lines (LATENT SEMANTIC ANALYSIS) [reading]
    5. other forms of partitioning (GENETIC ALGORITHMS) [reading]
  5. one thing after the other... (language as a sequence)
    1. language models, n-grams (MARKOV MODELS) [reading]
    2. statistical machine translation (ALIGNMENT) [reading]
    3. union makes force (MULTIPLE SEQUENCE ALIGNMENT) [reading]
    4. divide and you'll win (SEGMENTATION) [reading]
  6. ... or everything jumbled (text as a graph)
    1. information retrieval: relevance assessment for keyword and passage extraction, document retrieval (GRAPH THEORY) [reading]
    2. chains [reading]
    3. partially known semantic - syntactic structures (coordination, parallelism) [reading]
  7. enhancing resources (BOOTSTRAPPING) [reading]
  8. what next?


what you should read


practical issues


The course will consist of 120 hours, of which 60 will be lecture hours, 10 will be mentoring and 50 will be covered by the practical exercices to be completed by students. Lecture hours will most probably be distributed in two-hour sessions twice a week, spanning for four months. Concrete days and times for lectures will be set to meet attendant's needs.


There will be a minimum of 2 big assingments, one to be handed after Easter and another at the end of the course. There will be a list of possibilities for students to choose their assignment. The goal of these assignments is that students learn to plan and carry out text mining projects from end to end, that is, from design to evaluation of the results. Graduate students are expected to produce a conference-style paper.

Besides, exercices will be proposed every two weeks, to be handed in two weeks. The goal of these weekly exercices is that students learn how to use data mining tools and to deal with text. In well motivated cases, a student can present a paper in class instead of carrying out a practical exercise.

One of the corpora that you can use (in case you don't have any better option):


The subject will be evaluated as follows: 60% will correspond to practical exercises, 30% to a written exam, 10% to exercices proposed in class to practice the basics of data mining techniques (to be done as homework) and 10% will correspond to answers given in class during the discussion of papers.


to know more...

Untangling Text Data Mining


journals, conferences


tools and resources

Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs.

MALLET is a Java toolkit for machine learning applied to natural language. It provides facilities for document classification, information extraction, part-of-speech tagging, noun phrase segmentation, general finite state transducers and classification

The R Project for Statistical Computing: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

the NIST/SEMATECH e-Handbook of Statistical Methods, particularly the chapter on exploratory data analysis

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.

There's also a big range in proprietary software for data mining: clementine (an enterprise data mining workbench that enables you to develop predictive models quickly using business expertise and deploy them into business operations to improve decision making) and text mining for clementine, sas (superior power that gives you the power to know)


a course on Information Retrieval and Web Mining (2005) by Chris Manning and Prabhakar Raghavan, with slides of all the lectures. Deals with: document (and info) indexing, efficient treatment of indexed collections, query expansion, efficient search. Clear crystal intros to basic concepts and techniques: vector space, clustering, classification.

natural language processing

machine learning

a course on Knowledge Discovery in Databases by Howard J. Hamilton