import React, {Component} from 'react';
import Typography from '@material-ui/core/Typography';
import Aux from './Aux.js';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';

export function DataCollectionText() {

    return (
        <Aux>
            <Typography variant="h5" paragraph={true}>
                    Dataset of Functional Analysis
            </Typography>
            <Typography align="justify" paragraph={true}>
            Gene lists associated with proteostasis machinery were manually created. Lists for eight eukaryotic model organisms 
            (Homo sapiens, Gallus gallus, Drosophila melanogaster, Danio rerio, Caenorhabditis elegans, Saccharomyces cerevisiae, 
            Mus musculus and Arabidopsis Thaliana) and two generic lists for bacteria and archaea, composed the initial dataset for
            the functional analysis. In order to correct annotation omissions, gene lists were enriched, exploiting homologies 
            with Homo sapiens set of genes. Furthermore, their pool was extended with two other organisms, 
            Glycine max and Aspergillus nidulans, based on their homologies with Arabidopsis thaliana and Saccharomyces cerevisiae respectively. 
            Homology mappings were retrieved from the Ensembl database (Hubbard et al., 2002), which encompasses different repositories 
            for the main eukaryotes subclasses (Fig.1). Concerning the Prokaryotes, we used the UniProt/SwissProt database 
            (Apweiler et al., 2004) which contains high-quality, manually-curated annotations of thousands of prokaryotic genomes. 
            All organisms were sorted according to their extent of functional annotation in order to detect the most well studied species. 
            Finally, seven bacteria and three archaea were selected, completing the final list of organisms (Fig.2). The interpretation of 
            proteostasis related genes was achieved through the enrichment analysis with Gene Ontology (GO) (Ashburner et al., 2000). 
            Specifically, GO Biological Process (GO BP) and Cellular Components (GO CC) domains were deployed. These ontological schemas 
            form voluminous directed acyclic graphs, where each node is represented as a vocabulary term, interconnected with other terms 
            through parent-child relations. We used BioInfoMiner (Lhomond et al., 2018) to perform the enrichment analysis, as it adopts 
            data-driven methodologies and exploits the graphical structure of ontologies to mitigate experimental and annotation noise. 
            Before executing the interpretation, we transformed the GO annotation of each organism to specific format and parsed it to MongoDB, 
            to make it readable by BioInfoMiner.
            </Typography>
            <Typography variant="h5" paragraph={true}>
                    Dataset of Ribosomal and Protein Sequences
            </Typography>
            <Typography align="justify">
            Ribosomal (18S and 16S rRNA) sequences were retrieved from the European Nucleotide Archive (ENA) (Leinonen et al., 2003) and NCBI 
            repositories (Fig.3), while HSP40 and HSP70 amino acid sequences were collected from UniProt database. Ribosomal sequences are 
            consistent for each organism and different studies end up to approximately the same sequence of nucleotides. On the other hand, 
            heat shock proteins of the same molecular weight (for instance HSP70) could vary significantly even for the same organism. 
            Particularly, each protein constitutes a family of molecules which encompasses some identical functional domains, but other 
            additional components or their formation in space could be different. As a result, repositories for heat hock protein families 
            included many different sequences, posing their pre-processing and the construction of a unified sequence pattern a mandatory task. 
            For that purpose, we constructed a computational pipeline which ends up to HSP40 and HSP70 consensus sequences for each organism (Fig.4). 
            Starting from the whole set of a protein family, we kept only these which length had absolute z-score lower than or equal to one. 
            The remaining sequences were clustered with CD-HIT algorithm and similarity threshold to 90% (Li et al., 2001). CD-HIT keeps the longest sequence as 
            the representative of each cluster, conserving as more information as possible for each one. If the output included more than one clusters, 
            then we performed an additional step by constructing their multiple sequence alignment (MSA) with ClustalW (Lombard et al., 2002) and the 
            respective hidden Markov model (HMM) with HMMER3 hmmbuild algorithm (Eddy, 2011). The final consensus sequence was calculated again with 
            HMMER3 software, using the hmmemit function.
            </Typography>
        </Aux>
    )

}




export function MethodsText1() {

    return (
        <Aux>
            <Typography variant="h5" paragraph={true}>
                    Introduction
            </Typography>
            <Typography align="justify" paragraph={true}>
            Εnrichment analysis was performed on the organism-specific gene lists in order to disclose the systemic imprint of proteostasis. 
            Each gene list was analyzed through BioInfoMiner, making use of the Gene Ontology BP and CC ontologies. Due to the bias of 
            scientific research, the majority of the investigated eukaryotes, as well as Escherichia coli, are well-studied organisms 
            and their genomic annotation is spanning the whole ontological graph. General terms are linked with more specific ones, 
            which reflect explicit, highly informative ontological terms. On the other hand, the distribution of the prokaryote gene 
            annotations does not spread to the maximum depth of graph, leading to annotation inconsistencies. Therefore, different p-value 
            thresholds of BioInfoMiner were adopted, in order to extract quantitatively adequate set of enriched terms for each organism. 
            Setting the minimum accepted amount of terms to 30, different combinations of hypergeometric and adjusted p-value were examined, 
            aiming to reach or to surpass that amount. 
            </Typography>
            <Typography align="justify" paragraph={true}>
            The detection of proteostasis evolution among the investigated organisms requires the quantitative comparison of their enriched GO term 
            lists, as they constitute a thorough interpretation of the related genes. However, such a comparison needs to take into consideration 
            the hierarchical structure of the respective ontological graphs. Namely, it is not meaningful to compare different sets of ontological 
            terms, deploying traditional measures of the Element Theory, such as Jaccard coefficient or Dice similarity, because terms are not 
            isolated entities but they are connected semantically with ancestor - descendant relations. Furthermore, the existence of semantically 
            similar terms in the same list provokes inevitable biases. Taking into account all the above, we designed a novel approach to calculate 
            the group-wise semantic similarity of two lists of GO terms, exploiting their topological proximities and avoiding the 
            negative effects of annotation bias.
            </Typography>


            <Typography variant="h5" paragraph={true}>
                    Construction of Phylogenetic Trees
            </Typography>
            <Typography align="justify" paragraph={true}>
            To begin with, the extracted lists of BioInfoMiner analysis were filtered in order to eradicate potential semantic overlap among terms. 
            That step ended up to two lists of unique terms for each organism (BP & CC terms). A term is determined as unique if there are not any 
            of its ancestors in the same list. As a result, the most specific entities are included in that list, filtering out any more generic description 
            of the same biological process or cellular componenent. For the proceeding formulas, initial lists are notated as <InlineMath math='GO_{total}'/> and 
            unique lists as <InlineMath math='GO_{unique}'/>. A traditional metric to compare two ontological terms is the Resnik equation (Resnik, 1995), 
            which defines the semantic similarity between two terms as the information content of their most informative common ancestor (mica) on the directed graph. 
            Based on that element-wise measure, the aggregated semantic similarity (<InlineMath math='AggSemSim_{AB}'/>) of a group of terms A with another 
            group B could be defined as (Wang et al., 2007):

            <BlockMath math="AggSemSim_{AB}=\sum_{i \in GO_{A}} max\bigg[ SemSim(go_{i},GO_{B}) \bigg] (1)"/>

            where <InlineMath math="max[ SemSim(go_{i},GO_{B}) ]" /> is the maximum semantic similarity of the term i with the set B. 
            In order to calculate the group-wise similarity, <InlineMath math='AggSemSim_{BA}'/> should be also measured, while the sum of aggregated 
            amounts needs to be normalized by the sum of lengths: 

            <BlockMath math="NormSemSim = \frac{AggSemSim_{AB} + AggSemSim_{BA}}{GO_{A} + GO_{B}} (2)"/>

            Intuitively, the distance of sets A and B could be defined as:

            <BlockMath math="NormDist = 1 - NormSemSim (3)"/>
            </Typography>
            <Typography align="justify" paragraph={true}>
            As it is mentioned above, we reduced the size of term lists by keeping only the most specific, a process that does not cause any 
            loss of descriptive information. However, organisms are annotated in different levels of description due to the biased interests 
            of scientific studies. For that reason, that filtration might provoked inconsistencies between two organisms. In particular, 
            common terms between two organisms could be excluded from the well annotated one, because some of their descendant meanings are also enriched, 
            while they would be promoted as unique terms in the list of less studied species, because it lacks of in-depth annotation. 
            Aiming to avoid such misinterpretations, we alter the equations 1 and 2 as follows:

            <BlockMath math="AggSemSim_{AB}=\sum_{i \in GO_{A_{unique}}} max\bigg[ SemSim(go_{i},GO_{B_{total}}) \bigg] (3)"/>

            <BlockMath math="NormSemSim = \frac{AggSemSim_{AB} + AggSemSim_{BA}}{GO_{A_{unique}} + GO_{B_{unique}}} (4)"/>

            The use of <InlineMath math='GO_{unique}'/> prevents from the bias of any redundancy in the examined list of terms. Its comparison 
            with <InlineMath math='GO_{total}'/> is performed on a more amplified and detailed ontological set, where redundancies does not affect the 
            final result, as the equation 4 uses only the maximum similarities. Under that approach, we constructed a semantic distance matrix for 
            the selected organisms and the inferred phylogenetic tree through the hierarchical clustering of organisms. To execute that task, 
            we used the implementation of agglomerative clustering with Ward variance minimisation algorithm of Python scipy package.
            </Typography>


            <Typography variant="h5" paragraph={true}>
                Components Related to Proteostasis 
            </Typography>
            <Typography align="justify" paragraph={true}>
            All statistically significant unique terms were gathered and mapped on the GO BP and GO CC graphs to capture the unified snapshots of 
            cell mechanisms and cellular compartments. That process revealed many linked ontological terms, stressing the need for graph pruning, 
            to homogenize functional profiles and standardize the comparative analysis. To this end, we clustered them based on the Resnik measure, 
            reducing the enriched GO BP and GO CC graphs to 40 and 25 clusters respectively. At each clustering round, the most similar pair was 
            substituted by its most informative common ancestor. Finally, the participation of each organism to these clusters was identified, 
            to detect similar motifs among species and project them to the respective correlation heatmap.
            </Typography>

        </Aux>

    )

}





export function MethodsText2() {

    return (
        <Aux>
            <Typography variant="h5" paragraph={true}>
                    Sequences Comparison
            </Typography>
            <Typography align="justify" paragraph={true}>
            The distance matrices for ribosomal sequences and HSP40 and HSP70 protein families were constructed with ClustalW. Subsequently, 
            the phylogenetic trees were created using the same hierarchical clustering approach as the case of functional analysis (Fig. 9-11).
            </Typography>
        </Aux>

    )

}




export function MethodsText3() {

    return (
        <Aux>
            <Typography variant="h5" paragraph={true}>
                    Dendrograms Comparison
            </Typography>
            <Typography align="justify" paragraph={true}>
            While the extracted distance matrices and the phylogenetic trees revealed differences and commonalities among the 
            four evolutionary criteria, we performed another computational task, to elucidate further the emerged distance patterns 
            among species and superkingdoms. Deploying the multi-dimensional scaling algorithm (MDS), each distance matrix was 
            transformed to a two dimensions representation. Thus, each two dimensional vector refered to the location of an organism 
            in Cartesian space. Afterwards, K-means clustering with k=1 grouped all the vectors into one cluster, calculating the 
            coordinates of the respective centroid. Intuitively, centroid could represent the initial point of each evolutionary 
            tree and all the other points on the scatter plot indicate the distance of each organism from that entity (Fig. 12).
            </Typography>
        </Aux>

    )

}