Andrews, R., Geva S. (1994). Rule extraction from a constrained error backpropagation MLP. Australian Conference on Neural Networks, Brisbane, Queensland 1994 (pp. 9–12).Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley.Google Scholar
Chen, H.H. (2002). Multilingual summarization and question answering. Workshop on Multilingual Summarization and Question Answering, COLING’02, Taipeh, Taiwan 2002.Google Scholar
Chitashvili, R.J., Baayen, R.H. (1993). Word frequency distributions. In G. Altmann, L. Hřebíček (Eds.), Quantitative Text Analysis (pp. 54–135). Wvt: Trier.Google Scholar
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41 (6), 391–407.CrossRefGoogle Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38.Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paaß, G. (2003). Authorship attribution with Support Vector Machines. Applied Intelligence, 19 (1–2), 109–123.Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management (pp. 148–155). ACM.Google Scholar
Gövert, B., Lalmas, M., Fuhr, N. (1999). A probabilistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management, Kansas City, Missouri, 1999 (pp. 475–482). ACM.Google Scholar
Guiter, H. (1974). Les rélations fréquence — longueur — sens des mots (langues romanes et anglais), In XIV congresso internazionale di linguistica e filologia romanza (pp. 373–381). Napoli.Google Scholar
Hahn, U., Reimer, U. (1999). Knowledge-based text summarization. In: I. Mani, M. T. Maybury (Eds.), Advances in Automated Text Summarization (pp. 215–232). Cambridge, London: MIT-Press.Google Scholar
Hand, D., Mannila, H., Smyth, P (2001). Principles of data mining. MIT Press.Google Scholar
Hartigan, J.A. (1975). Clustering algorithms. New York: John Wiley.Google Scholar
Hastie T., Tibshirani, R., Friedman, J. (2001). The elements of statistical learning. New York: Springer.Google Scholar
Hofman, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177–196.Google Scholar
Holmes, D.I. (1998). The evolution of stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13 (3), 111–117.CrossRefGoogle Scholar
Holmes, D.I., Forsyth, R.S. (1995). The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10 (2), 111–127.CrossRefGoogle Scholar
Kohonen, T. (1980). Content-adressable memories. Springer.Google Scholar
Kohonen, T. (1995). Self-organising Maps. Springer.Google Scholar
Kosala, R. Blockeel, H. (2000). Web mining research: A Survey. In P.S. Bradley, S. Sarawagi, U.M. Fayyad (Eds.), SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2 (pp. 1–15). ACM Press.Google Scholar
Kraaij, W., Spitters, M., Hulth, A. (2002). Headline extraction based on a combination of uniand multidocument summarization techniques. In Proceedings of the ACL workshop on Automatic Summarization/Document Understanding Conference DUC 2002, June 2002, Philadelphia, USA.Google Scholar
Joachims, T. (1998a). Making large-scale SVM learning practical, Technical report University of Dortmund.Google Scholar
Joachims, T. (1998b). Text categorization with Support Vector Machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning, Springer Lecture Notes in Computer Science, Vol. 1398 (pp. 137–142). Springer.Google Scholar
Landauer, T.K., Dumais, S.T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104 (2), 211–240.CrossRefGoogle Scholar
Lang, K. (1995). Newsweeder: Learning to filter netnews. In A. Prieditis, S. Russell (Eds.), Proceedings of the 12th International Conferrence on Machine Learning (pp. 331–339). San Francisco: Morgan Kaufmann Publishers.Google Scholar
Leopold, E., Kindermann, J. (2002). Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning, 46, 423–444.CrossRefGoogle Scholar
Lowe, D., Matthews, R. (1995). Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities, 29, 449–461.CrossRefGoogle Scholar
Manning, C.D., Schütze, H.(1999). Foundations of statistical natural language processing. Cambridge MA, London: MIT Press.Google Scholar
Mitchell, Tom (1997). Machine Learning. Boston et al.: McGraw-Hill.Google Scholar
Mladenic, D., Grobelnik M. (1999). Feature selection for unbalanced class distribution and naive Bayes. In I. Bratko, S. Dzeroski (Eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999) (pp. 258–267). San Francisco: Morgan Kaufmann.Google Scholar
Neumann, G., Schmeier, S. (2002). Shallow natural language technology and text mining. Künstliche Intelligenz, 2002 (2), 23–26.Google Scholar
Neumann, G., Piskorski, J. (2002). A Shallow text processing core engine. Computational Intelligence, 18 (3), 451–476.CrossRefGoogle Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchel, T. (1999). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39 (1/2), 103–134.Google Scholar
Paaß, G., Leopold, E., Larson, M., Kindermann, J., Eickeler, S. (2002). SVM Classification using sequences of phonemes and syllables. Tapio Elomaa & Heikki Mannila & Hannu Toivonen (Eds.), Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002); August 19–23, 2002 Helsinki, Finland, Lecture Notes in Artificial Intelligence 2431 (pp. 373–384) Berlin, Heidelberg: Springer.Google Scholar
Porter, M.F. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14 (3), 130–137.Google Scholar
Rudman, J. (1998). The state of authorship attribution studies: some problems and solutions. Computers and the Humanities, 31, 351–365.Google Scholar
Salton, G., McGill, M.J. 1983. Introduction to modern information retrieval. New York: McGraw Hill.Google Scholar
Shapire, R.E., Singer, Y. (2000). BoosTexter: a boosting based system for text categorization. Machine Learning, 39, 135–168.Google Scholar
Sparck-Jones, K. (1999). Automatic summarizing: factors and directions. In I. Mani, M.T. Maybury (Eds.), Advances in Automated Text Summarization.Google Scholar
Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N. (2000). Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Exploratins, 1 (2), 12–23.Google Scholar
Stö ber, K., Wagner, P., Helbit, J., Köster, S., Stall, D., Thomae, M., Blauert, J., Hess, W., Hoffmann, R., Mangold, H. (2000). Speech synthesis by multilevel selection and concatenation of units from large speech Corpora. In: W. Wahlster (Ed.), Verb-mobil. Springer, 2000.Google Scholar
Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F. (2000). Vers la conception de filtres ďinformations efficaces. In Reconnaissance des Formes et Intelligence Artificielle (RFIA’ 2000) (pp. 129–137).Google Scholar
Thisted, R., Efron, B. (1987). Did Shakespeare write a newly discovered poem? Biometrika, 74 (3), 445–55.Google Scholar
Thisted, R. (1988). Elements of statistical computing. London: Chapman&Hall.Google Scholar
Towsey, M., Diederich, J., Schellhammer, I., Chalup, S., Brugman, C. (1998). Natural language learning by recurrent neural networks: A comparison with probabilistic approaches. Computational natural language learning conference. Australian Natural Language Processing Fortnight. Sydney: Macquarie University, 15–17 Jan 1998.Google Scholar
Tweedie, F.J., Singh, S., Holmes, D.I. (1996). Neural network applications in stylometry: the federalist paper. Computers and the Humanities, 30, 1–10.CrossRefGoogle Scholar
van Rijsbergen, C.J. (1979). Information Retrieval. London, Boston: Butterworths.Google Scholar
Vapnik, V.N. (1998). Statistical Learning Theory. New York et al.: Wiley & Sons.Google Scholar
Weiss, S.M., Apt, C., Damerau, F., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T. (1999). Maximizing textmining performance. IEEE Intelligent Systems, 14 (4), 63–69.CrossRefGoogle Scholar
Academics: prepare your computers for text-mining. Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely to follow suit this year, lowering barriers to the computer-based research technique. But some scientists object that even as publishers roll out improved technical infrastructure and allow greater access, they are exerting tight legal controls over the way text-mining is done.
A few years ago, scientists complained that publishers were stymieing ambitious plans to use computer software to pull out information from published papers. Some researchers who ran software to harvest data from online articles found their programs blocked, and those who asked for permission found themselves trapped in tortuous case-by-case negotiations — even though they had already paid subscription fees for access. Max Haeussler, a computational biologist at the University of California, Santa Cruz, for instance, spent more than three years arguing with publishers for permission to extract DNA data from 3 million articles to annotate an online map of the human genome (see Nature 483, 134–135; 2012).
“It was a legitimate criticism, that people sent text-mining requests in to publishers and they bounced around for a time without any response,” admits Chris Shillum, vice-president of product management for platform and content at Elsevier. The publisher previously considered requests “case by case”, he says — but it now wants to make text-mining permissions quicker and easier to obtain. “What we’ve tried to do is take the practical barriers away.”
Under the arrangements, announced on 26 January at the American Library Association conference in Philadelphia, Pennsylvania, researchers at academic institutions can use Elsevier’s online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a licence that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content.
“Finally, someone is showing that there is no need to be afraid of text-mining analysis any more,” says Haeussler.
Researchers working on the Human Brain Project — a European consortium that plans to use a supercomputer to recreate everything known about the human brain — have already used Elsevier’s interface to do text-mining, says the project’s spokesman Richard Walker, who is based at the Swiss Federal Institute of Technology in Lausanne. “We are very pleased with it. It resolves genuine technical issues,” he says.
And neuroscientist Shreejoy Tripathy at the University of British Columbia in Vancouver, Canada, worked with Elsevier last year to pull out information on neuron physiology from thousands of articles (see neuroelectro.org). Text-mining is not yet well known, he says, but he hopes that the easier access will kick off its greater adoption among scientists. “As more papers get published that use text-mining, other researchers like myself — who are neuroscientists and not programmers — will see the need for the technique,” he says.
Shillum says that Elsevier is ahead of the curve — but that other publishers are likely to follow soon. CrossRef, a non-profit collaboration of thousands of scholarly publishers, will in the next few months launch a service that lets researchers agree to standard text-mining terms and conditions by clicking a button on a publisher’s website, a ‘one-click’ solution similar to Elsevier’s set-up.
And, in the past year, large institutions and pharmaceutical companies have started to ask for text- and data-mining rights when renegotiating site licences, says Jessica Rutt, rights and licensing manager at Nature Publishing Group (NPG), the publisher of this journal. Anyone with those rights may mine NPG content. Many publishers are also experimenting with delivering text-minable content to pharmaceutical companies for an extra fee, she adds.
But some researchers feel that a dangerous precedent is being set. They argue that publishers wrongly characterize text-mining as an activity that requires extra rights to be granted by licence from a copyright holder, and they feel that computational reading should require no more permission than human reading. “The right to read is the right to mine,” says Ross Mounce of the University of Bath, UK, who is using content-mining to construct maps of species’ evolutionary relationships.
National governments are also weighing in on the issue. The UK government aims this April to make text-mining for non-commercial purposes exempt from copyright, allowing academics to mine any content they have paid for. And the European Commission, worried that barriers to computational research could hinder scientific innovation, is also examining the issue. It has convened a group chaired by Ian Hargreaves, an intellectual-property specialist at Cardiff University, UK, who recommended the changes to UK law, to examine the economic impact of text- and data-mining for scientific research and barriers to its use. The panel will reach conclusions by the end of February.
“Our plan is just to wait for the copyright exemption to come into law in the United Kingdom so we can do our own content-mining our own way, on our own platform, with our own tools,” says Mounce. “Our project plans to mine Elsevier’s content, but we neither want nor need the restricted service they are announcing here.”
“Finally, someone is showing that there is no need to be afraid of text-mining analysis.”
- Journal name:
- Date published: