some short papers illustrating up-to-date experiments and applications:

August 15

Introduction to the course: brainstorming, what is text mining? what is it useful for?
Evaluation: how do you evaluate what you still don't know? baselines, hypothesis tests

compulsory reading: W. Fan, L. Wallace, S. Rich, Z. Zhang, 2005. Tapping into the power of text mining, Communications of ACM.
recommended reading: success cases:
homework: think about what you would like to do as course project, find corpora that are liable to be used to obtain some ordered learning.
You can also have a tour:
slides: Untangling Text Data Mining, by Marti Hearst

August 20

Natural Language Processing: classical architectures, data-driven solutions

compulsory reading: Introduction to Speech and Language Processing, by D. Jurafsky and J. H. Martin (2000)
recommended reading: J. Allen (1987) Natural Language Understanding
homework: download Weka, install it and get it to run, check that you can open and successfully read the manual
slides: we will be using the first part my talk on free NLP software resources and we will work on the blackboard.

August 22

Math Foundations and Linguistic Essentials

compulsory reading: none!
recommended reading: chapters 1 and 2 from Foundations of Statistical NLP.
homework: prepare corpus data to work with Weka in clustering
slides: we will be working with an extract of Lluís Padró's slides on Statistical Methods for NLP. We will also peruse Lluís Màrquez's slides on Machine Learning for NLP. We might use some examples of the slides on Linguistic Essentials to illustrate linguistic phenomena and we may also resort to some other slides on Math Foundations.

August 27


August 29

Data-driven characterization of linguistic phenomena: clustering

compulsory reading: Clustering, chapter 14 of Foundations of Statistical NLP
recommended reading: chapter on exploratory data analysis from the NIST/SEMATECH e-Handbook of Statistical Methods
homework 1: get acquainted with Weka's functionalities for clustering (at least read the manual!)
homework 2: toy experiments with clustering words to find equivalence classes.

You should be able to present succintly (2-3 minutes) the results of your experiments in class, in one-two weeks time.
slides: we'll be using Chris Manning's slides on document clustering, (also in .pdf) since they follow the chapter quite closely.

You can also play with Dekang Lin's demo page on similarity of words and discuss them in class.

September 3 and 5

fully supervised >> mildly supervised >> fully unsupervised

(comparison in the task word sense disambiguation)

compulsory reading:

recommended reading: Word Sense Disambiguation, chapter 7 of Foundations of Statistical
slides: we'll be skimming through some of the slides from the tutorial Advances in Word Sense Disambiguation given by Rada Mihalcea and Ted Pedersen at IBERAMIA-2004 ACL-2005 and AAAI-2005

September 12

refining clustering for word sense discrimination

compulsory reading: Pantel, P. and Lin, D. 2002. Discovering Word Senses from Text. KDD-02
recommended reading:
slides: some bunch of slides of mine commenting on the reading (believe it or not, for once it was me who made the slides!)

September 17

small-world graphs to discover minor word senses for fine-grained Information Retrieval

compulsory reading: J. Véronis. 2004. HyperLex: Lexical Cartography for Information Retrieval. Computer, Speech and Language, 18 (3)

September 19

unsupervised, graph-based techniques as applied to Biomedical Text Mining

compulsory reading: Amgad Madkour, Kareem Darwish, Hany Hassan, Ahmed Hassan, Ossama Emam. 2007. BioNoculars: Extracting Protein-Protein Interactions from Biomedical Text. In Proceedings of the Workshop on Biological translational and clinical language processing, ACL'07.
recommended reading: Takaaki Hasegawa Satoshi Sekine and Ralph Grishman. 2004. Discovering Relations among Named Entities from Large Corpora. ACL 2004

September 24


September 26


October 1

Association rules as applied to text

compulsory reading: Bernardi, M., Lapi, M., Leo, P., Loglisci, C. 2005. Mining Generalized Association Rules on Biomedical Literature. In: Moonis, A. Esposito, F. (eds): Innovations in Applied Artificial Intelligence. Lect. Notes Artif. Int. 3353 (2005) 500-509
slides: slides for Chapter 6, Association Analysis: Basic Concepts and Algorithms (612KB), from the book Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

October 3

Language as a sequence: multiple sequence alignment and paraphrase extraction

compulsory reading: Regina Barzilay and Kathy McKeown. 2001. Extracting Paraphrases from a Parallel Corpus. ACL
recommended reading: check Patrick Lambert's Bibliography for Statistical Alignment and Machine Translation (strongly recommended), and also:
homework: skim and get a good (discussable) idea of the workbook A Statistical MT Tutorial Workbook prepared in the JHU summer workshop.
just for fun: take a look at alphamalig, the multiple alignment tool with parametrisable distances and alphabets.
slides: the skeleton of the lecture were these, Nathalie Japkowicz slides on chapter 13 for her course Natural Language Processing, A Statistical Approach. We saw some examples on difficult cases for alignment, and the IBM models, in Chris Manning's slides for his lecture on Statistical Machine Translation in his course on Natural Language Processing. The slides for dynamic programming techniques applied to sequence alignment were taken from Robert W. Robinson.

define the schedule for your project, you should already be reading state of the art on the subject

October 6

Multiple Sequence Alignment to discover the structure of natural languages

compulsory reading: Z. Solan, D. Horn, E. Ruppin, and S. Edelman, Unsupervised learning of natural languages, PNAS
recommended reading: W. R. Pearson and D. J. Lipman (1988) Improved Tools for Biological Sequence Comparison. PNAS 85:2444- 2448
slides: D. Horn on Adios.

October 8

Lineal segmentation and word chains

compulsory reading:
recommended reading: Freddy Y. Y. Choi, Peter Wiemer-Hastings, Johanna Moore. 2001. Latent Semantic Analysis for Text Segmentation. Proceedings of 6th EMNLP
slides: some of Marti Hearst's slides on discourse processing and text segmentation from her course Applied Natural Language Processing (2006).

October 10

Learning of Morphology

compulsory reading: Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics. 27 (2)
recommended reading: Yu Hu, I. Matveeva, J. Goldsmith, C. Sprague. 2005. Using Morphology and Syntax Together in Unsupervised Learning. ACL workshop PsychoCompLA-2005

October 15

Ontology Induction and Population

compulsory reading:
M. Ruiz-Casado, E. Alfonseca and P. Castells. 2005. Automatic extraction of semantic relationships for WordNet by means of pattern learning from Wikipedia. Proceedings of NLDB-2005. In Natural Language Processing and Information Systems.
Pantel, P. 2005. Inducing Ontological Co-occurrence Vectors. ACL-05.
recommended reading: take a tour around the PASCAL ontology learning challenge (2006) and OLP3 – 3rd Workshop on Ontology Learning and Population held at ECAI 2008
slides: Eduard Hovy's slides for introducing ontologies and Patrick Pantel's slides for inducing ontological co-occurrence vectors.

October 17

Text as a graph: graph-based algorithms for NLP

compulsory reading: Avrim Blum and Shuchi Chawla. 2001. Learning from Labeled and Unlabeled Data using Graph Mincut. ICML'01.
recommended reading:
For a nice, short discussion of what is transductive learning, take a look at Thorsten Joachim's paper Transductive Learning via Spectral Graph Partitioning, Proceedings of the International Conference on Machine Learning (ICML), 2003. His page on Spectral Graph Partitioning also has interesting material.
papers presented at the workshop on graph-based algorithms for NLP'06 (HLT-06), TextGraphs-07 and TextGraphs-08.

October 22

Feature Selection

compulsory reading: Introduction to Feature Extraction, Foundations and Applications by Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti Zadeh, (eds). Series Studies in Fuzziness and Soft Computing, Physica-Verlag, Springer, 2006.
Huan Liu; Lei Yu. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, Volume 17, Issue 4, April 2005
recommended reading: Workshop on New challenges for feature selection in data mining and knowledge discovery 2008, ECML PKDD 2008
also the Special Issue on Variable and Feature Selection of the Journal of Machine Learning Research, 2003.

October 24

Learning Selectional Preferences.

compulsory reading: Shane Bergsma, Dekang Lin and Randy Goebel. 2008. Discriminative Learning of Selectional Preference from Unlabeled Text. EMNLP 2008

October 29

Information Extraction

compulsory reading: Matthew Michelson and Craig A. Knoblock. 2007. Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web. International Journal of Document Analysis and Recognition (IJDAR), Special Issue on Noisy Text Analytics.
Marius Pasca and Benjamin Van Durme. 2007. What You Seek is What You Get: Extraction of Class Attributes from Query Logs, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07).
Marius Pasca. 2007. Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds, Proceedings of the 16th International World Wide Web Conference (WWW-07).

recommended reading: Roman Yangarber, Ralph Grishman, Pasi Tapanainen and Silja Huttunen. 2000. Unsupervised Discovery of Scenario-Level Patterns for Information Extraction. In Proceedings of Conference on Applied Natural Language Processing ANLP-NAACL 2000 pp. 282-289, (2000) Seattle, WA.

October 31

Fact Extraction (mildly supervised)

compulsory reading: Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, Alpa Jain. 2006. Names and similarities on the web: Fact extraction in the fast lane. ACL. 2006

recommended reading:

November 5

Inference Rules

Rahul Bhagat, Patrick Pantel, Eduard Hovy. 2007. LEDIR: An Unsupervised Algorithm for Learning Directionality of Inference Rules. EMNLP'07.

November 7

Data-driven characterization of linguistic phenomena: latent semantic analysis, principal component analysis

compulsory reading: T. K. Landauer, P. W. Foltz, & D. Laham. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25.
recommended reading:
homework: you can try the Open Source LSA Package for R, you can also take a look at the LSA page of the University of Colorado.
slides: some very nice slides on principal component analysis from a course in Princeton... authorship unknown!

November 12

mining the wikipedia

compulsory reading: Fadi Biadsy; Julia Hirschberg; Elena Filatova. 2008. An Unsupervised Approach to Biography Production Using Wikipedia. ACL 2008.
Elif Yamangil; Rani Nelken. 2008. Mining Wikipedia Revision Histories for Improving Sentence Compression. ACL 2008.
recommended reading: list of academic papers that use Wikipedia

November 14

co-reference resolution

compulsory reading: Hoifung Poon and Pedro Domingos. 2008. Joint Unsupervised Coreference Resolution with Markov Logic. EMNLP 2008.
Vincent Ng. 2008. Unsupervised Models for Coreference Resolution. EMNLP 2008.

November 19

surprise topic: decipherment!!!

compulsory reading: Kevin Knight, Anish Nair, Nishit Rathod and Kenji Yamada. 2006. Unsupervised Analysis For Decipherment Problems. COLING-ACL 2006 (poster)

November 21

informal presentation of projects (as they are at the time) (no need for slides!), brainstorming with each other's projects.

discussion of the course: what was good, what could/should be improved, what did you learn, what did you want to learn?

what is the future of text mining?

take a look at the Grand Challenge of Text Mining, as stated by Ronen Feldman in the KDD-2006 panel report What are the Grand Challenges for Data Mining?

Text Mining is an exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. Text Mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules etc) and visualization of the results.

[...] we would like to have (this is our text mining grand challenge) Text mining systems that will be able to pass standard reading comprehension tests such as SAT, GRE, GMAT etc.

Systems that will be able to pass the average scores will win the grand challenge. The systems can utilize the web when answering the test questions. We view this grand challenge as an extension of the classic Turing test. This grand challenge satisfies most of the criteria that were set for the various challenges. First, there are no systems today that are able to get above average score in any of the standard tests. Second, the criterion for success is very well defined. Then, we believe that within 5 years researchers will be able to build such systems based on technologies that are developed for annual competitions such as ACE, TREC and TIDES. Finally, having such systems will contribute to the advance of humankind as the underlying technologies deployed by these systems can be utilized by children and adults to more rapidly acquire knowledge about various topics.