Search
Links and Functions
Language Selection
Breadcrumb Navigation
Main Navigation
Content
Arbeitsgebiet "Textmining"
Textmining plays an important role in bioinformatics, given that a significant part of biological knowledge is available only as free text. For extracting this knowledge, it is required to recognize biological objects in texts, this is especially true for genes and proteins.
Our research focuses on:
Information extraction
Generation of networks from texts for advanced analyses
Analysing information derived from Text-Mining together with data from other sources (e.g. gene expression data )
Research team
Protein Services
A thesaurus for gene and protein names.
Allows for:
manual querying and editing of the entries of curated synonym dictionaries
searching all Synonyms for a given gene/protein via PubMed and Google
contains dictionaries for human, mouse, rat, fly and yeast
A biological name and markup web service for automated querying via custom software.
Integration of the ProThesaurus web services into Microsoft Office applications for retrieval and markup of gene and protein names.
Syngrep
Named Entity Recognition of objects in biomedical texts
The scientific literature ist the most comprehensive source for information on molecular biology objects and their interactions. In free texts the biological objects (genes, proteins, diseases, cells, organisms) are unfortunately referenced by rather unsystematic and partially ambiguous terms. Automatic text mining procedures are required to identify the objects (Named Entity Recognition, NER) and to extract their interactions. Various spelling variants and synonyms need to be recognized in order to assign unique identifiers to ambiguous free text names.
Frequently, texts are matched against so-called synonym lists. Such lists for genes and proteins of higher organisms (e.g. humans) comprise 30k-50k objects per organism that can be referenced via up to 10-15 different synonyms. The size of the data is huge, e.g. multiple synonym lists could be simultaneously searched of a total of 170k objects with 1.6M synonyms against 19M PubMed abstracts (66Gb XML/26 Gb text). On the other hand, the results of the searches are highly sensitive to the quality and curation of the used synonym lists. We developed the tool syngrep that is more memory efficient and more than 100 times faster than well known Unix-tools such as fgrep. High quality matches of synonyms in the biological literature has important applications, for instance the assignment of annotation to the identified objects and augmentation of knowledge bases with literature derived facts.
RelEx
Extraktion of relationships between objects from the biomedical literature
Two approaches for relation extraction are currently employed. Methods of machine learning assign relations to particular sentences after training on suitable manually annotated training corpora. Linguistic methods, on the other hand, analyze the grammatical structure of sentences and recognize relations from partial sentences by appropriate rules. Usually, very high numbers of rules have to be defined to extract relations with sufficient precision and recall.
The RelEx approach uses the Stanford Lexicalized Parser to construct semantic dependency trees from the sentences. Due to the dependency trees only few tree rules are required to harvest a wide range of complex relationship phrases.
You can find the manually curated
test set of relations from the HPRD here
Publications
2007
2006
2005
Papers
Martin Szugat , Daniel Güttler , Florian Sohler , Ralf Zimmer .
Web Servicing the Biological Office .
Bioinformatics, vol 21, no. (Suppl. 2), pp. 268-269, 2005.
Katrin Fundel , Daniel Güttler , Ralf Zimmer , Joannis Apostolakis .
A simple approach for protein name identification: prospects and limits .
BMC Bioinformatics, vol 6, no. (Suppl.1), pp. S15, May 2005.
Katrin Fundel , Daniel Hanisch , Heinz-Theodor Mevissen , Ralf Zimmer , Juliane Fluck .
ProMiner: rule-based protein and gene entity recognition .
BMC Bioinformatics, vol 6, no. (Suppl.1), pp. S14, May 2005.
2004
2003
Papers
Daniel Hanisch , Juliane Fluck , Heinz-Theodor Mevissen , Ralf Zimmer .
Playing biology's name game: identifying protein names in scientific text .
Russ B. Altman , Keith A. Dunker , Lawrence Hunter , Teri E. Klein (eds.):
Proceedings of the 8th Pacific Symposium on Biocomputing (PSB 2003), Lihue, Hawaii, USA, January 3-7, 2003, pp. 403-414, 2003.
Service Menu
Footer