Data and Text Processing for Health and Life Sciences

of the publication entitled Polymorphisms and deduced amino acid substitutions in the coding sequence of the ryanodine receptor (RYR1) gene in individuals with malignant hyperthermia Data and Text Processing for Health and Life Sciences by F. Couto Slide 83 of 412 Data Retrieval Caffeine Example Finding phenotypic information, the first title that may attract our attention: Polymorphisms and deduced amino acid substitutions in the coding sequence of the ryanodine receptor (RYR1) gene in individuals with malignant hyperthermia Clicking on the Abstract link Data and Text Processing for Health and Life Sciences by F. Couto Slide 84 of 412 Data Retrieval Caffeine Example Diseases recognized by the online tool MER in an abstract Data and Text Processing for Health and Life Sciences by F. Couto Slide 85 of 412 Data Retrieval Caffeine Example Mentions any disease use an online text mining tool Minimal Named-Entity Recognizer (MER) http://labs.rd.ciencias.ulisboa.pt/mer/. copy and paste the abstract select DO Human Disease Ontology as lexicon Detects three mentions of malignant hyperthermia, link about the disease Data and Text Processing for Health and Life Sciences by F. Couto Slide 86 of 412 Data Retrieval Caffeine Example Ontobee entry for the class malignant hyperthermia Data and Text Processing for Health and Life Sciences by F. Couto Slide 87 of 412 Data Retrieval Caffeine Example Need to repeat all the steps to all the proteins all publications of each protein More complicated if all central nervous system stimulants Motivation to automatize the process, not humanly feasible Data and Text Processing for Health and Life Sciences by F. Couto Slide 88 of 412 Data Retrieval Caffeine Example Goal relation between caffeine and hyperthermia, simply search these two terms in PubMed 1 Some relations are not explicitly mention in the text 2 Example using different resources and multiple entries to automate using shell scripting Data and Text Processing for Health and Life Sciences by F. Couto Slide 89 of 412 Data Retrieval Unix shell


Tools also integrate semantic links
GOPubMed categorized texts using Gene Ontology PubTator annotated with biological entities Open Access Publications full-texts freely available with unrestricted use PubMed Central (PMC) more than 5 million documents In 1993 definition of ontology: an explicit specification of a conceptualization In 1997 and 1998 refined to: a formal, explicit specification of a shared conceptualization Conceptualization an abstract view of the concepts and the relationships of a given domain

Shared conceptualization a group of individuals agree (common agreement)
Specification is a representation of that conceptualization using a given language. needs to be formal and explicit so computers can deal with it "Identifiers" "Name" "Line Types" "A2AGL3" "Ryanodine receptor 3" "CC -MISCELLANEOUS" "A4GE69" "7-methylxanthosine synthase 1" "CC -FUNCTION" ... Data  Shows the full path of the directory (folder) in which the shell is working on.
The dollar sign in the left a command to be executed directly in the shell A curved arrow in the right a command does not fit in the available width of a page and has to be presented in multiple lines To understand a command line tool type man followed by the name of the tool.

For example man pwd
Or type pwd --help a more concise description of pwd.
ls shows list of files in the current directory.
Type ls --help a concise description of ls Select a current directory we can easily open in our file explorer application Useful key combinations Terminal is blocked press Ctrl-c cancels the current tool For example: try using the cd command with only one single quote: Now press Ctrl-c, and the command will be aborted.
Ctrl-d indicates the terminal that it is the end of input. command will not be canceled, executed without the second single quote a syntax error will be shown on our display Remove the extra carriage return: The -d option of tr removes a given character from the input this case delete all carriage returns (r) Command line options can be used in short form using a single dash (-) or in a long form using two dashes (--) --delete is equivalent to -d

Redirection operator
> character moves the results being displayed at the standard output (our terminal) to a given file.
< character works on the opposite direction opens a given file uses it as the standard input Not using a fully RESTful web service but pretty modular and self-explanatory The path is clearly composed of: -the name of the database (chebi); -the method (viewDbAutoXrefs.do); -list of parameters and their value (arguments) after ?
Order of the parameters is normally not relevant separated by & = assigns a value to each parameter (argument).

Need to extract only the identifiers
Word matching improves precision but decreases recall miss less common acronyms: All options All can be reproduced by {n,m} where n and m specify minimal and maximum number of occurrences may also be omitted, no limit is imposed Equivalent: Using {1,1} same as not having any operator. both are equivalent:    Example molecule ancestor of caffeine: The remainder of the molecule is hydrophilic and presumably constitutes the cytoplasmic domain of the protein.
Example disease ancestor of malignant hyperthermia: Our data suggest that divergent activity profiles may cause varied disease phenotypes by specific mutations.