Arbeitsgebiet "Textmining"

Textmining plays an important role in bioinformatics, given that a significant part of biological knowledge is available only as free text. For extracting this knowledge, it is required to recognize biological objects in texts, this is especially true for genes and proteins.

Our research focuses on:

Information extraction
Generation of networks from texts for advanced analyses
Analysing information derived from Text-Mining together with data from other sources (e.g. gene expression data)

Research team

Researchers

Students

Martin Szugat

Protein Services

ProThesaurus-Wiki

A thesaurus for gene and protein names.

Allows for:

manual querying and editing of the entries of curated synonym dictionaries
searching all Synonyms for a given gene/protein via PubMed and Google
contains dictionaries for human, mouse, rat, fly and yeast

ProThesaurus

A biological name and markup web service for automated querying via custom software.

ProTag - Web Servicing the Biological Office

Integration of the ProThesaurus web services into Microsoft Office applications for retrieval and markup of gene and protein names.

Syngrep

Named Entity Recognition of objects in biomedical texts

The scientific literature ist the most comprehensive source for information on molecular biology objects and their interactions. In free texts the biological objects (genes, proteins, diseases, cells, organisms) are unfortunately referenced by rather unsystematic and partially ambiguous terms. Automatic text mining procedures are required to identify the objects (Named Entity Recognition, NER) and to extract their interactions. Various spelling variants and synonyms need to be recognized in order to assign unique identifiers to ambiguous free text names.
Frequently, texts are matched against so-called synonym lists. Such lists for genes and proteins of higher organisms (e.g. humans) comprise 30k-50k objects per organism that can be referenced via up to 10-15 different synonyms. The size of the data is huge, e.g. multiple synonym lists could be simultaneously searched of a total of 170k objects with 1.6M synonyms against 19M PubMed abstracts (66Gb XML/26 Gb text). On the other hand, the results of the searches are highly sensitive to the quality and curation of the used synonym lists. We developed the tool syngrep that is more memory efficient and more than 100 times faster than well known Unix-tools such as fgrep. High quality matches of synonyms in the biological literature has important applications, for instance the assignment of annotation to the identified objects and augmentation of knowledge bases with literature derived facts.

RelEx

Extraktion of relationships between objects from the biomedical literature

Two approaches for relation extraction are currently employed. Methods of machine learning assign relations to particular sentences after training on suitable manually annotated training corpora. Linguistic methods, on the other hand, analyze the grammatical structure of sentences and recognize relations from partial sentences by appropriate rules. Usually, very high numbers of rules have to be defined to extract relations with sufficient precision and recall.
The RelEx approach uses the Stanford Lexicalized Parser to construct semantic dependency trees from the sentences. Due to the dependency trees only few tree rules are required to harvest a wide range of complex relationship phrases.

2007

Papers

Katrin Fundel, R. Küffner, Ralf Zimmer. RelEx - Relation extraction using dependency parse trees. Bioinformatics, vol 23, no. 3, pp. 365-371, 2007.

BibTex

You can find the manually curated test set of relations from the HPRD here

Publications

2007

Katrin Fundel, R. Küffner, Ralf Zimmer. RelEx - Relation extraction using dependency parse trees. Bioinformatics, vol 23, no. 3, pp. 365-371, 2007.

BibTex

Katrin Fundel, Ralf Zimmer. Human Gene Normalization by an Integrated Approach including Abbreviation Resolution and Disambiguation. Second BioCreative Challenge Evaluation Workshop, 2007.

BibTex

2006

Papers

Katrin Fundel, Ralf Zimmer. Gene and protein nomenclature in public databases. BMC Bioinformatics, vol 7, pp. 372, 2006.

BibTex

2005

Papers

R. Küffner, Katrin Fundel, Ralf Zimmer. Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics, vol 21, no. Suppl 2, pp. 259-267, 2005.

BibTex

Martin Szugat, Daniel Güttler, Florian Sohler, Ralf Zimmer. Web Servicing the Biological Office. Bioinformatics, vol 21, no. (Suppl. 2), pp. 268-269, 2005.

BibTex

Katrin Fundel, Daniel Güttler, Ralf Zimmer, Joannis Apostolakis. A simple approach for protein name identification: prospects and limits. BMC Bioinformatics, vol 6, no. (Suppl.1), pp. S15, May 2005.

BibTex

Katrin Fundel, Daniel Hanisch, Heinz-Theodor Mevissen, Ralf Zimmer, Juliane Fluck. ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics, vol 6, no. (Suppl.1), pp. S14, May 2005.

BibTex

2004

Papers

Katrin Fundel, Daniel Güttler, Ralf Zimmer, Joannis Apostolakis. Exact versus approximate string matching for protein name identification. BioCreative Challenge Evaluation Workshop, 2004.

BibTex

Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, Juliane Fluck. ProMiner: Organism-specific protein name detection using approximate string matching. BioCreative Challenge Evaluation Workshop, 2004.

BibTex

2003

Papers

Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, Ralf Zimmer. Playing biology's name game: identifying protein names in scientific text. Russ B. Altman, Keith A. Dunker, Lawrence Hunter, Teri E. Klein (eds.): Proceedings of the 8th Pacific Symposium on Biocomputing (PSB 2003), Lihue, Hawaii, USA, January 3-7, 2003, pp. 403-414, 2003.

BibTex

Search

Links and Functions

Language Selection

User Menu

Breadcrumb Navigation

Main Navigation

Content

Researchers

Students

ProThesaurus-Wiki

ProThesaurus

ProTag - Web Servicing the Biological Office

Named Entity Recognition of objects in biomedical texts

Extraktion of relationships between objects from the biomedical literature

Service Menu

Footer