gsnn.gsnn.proc.bio

Functions

build_uniprot_symbol_map(func_edges)

All (uniprot, gene_symbol, func_name) triples observed on any edge endpoint.

ensg2symbol(ensg_ids[, allow, drop_na])

Map Ensembl gene IDs (ENSG) to HGNC gene symbols using PyPath.

get_bio_interactions([undirected, ...])

Retrieve and standardise directed biological interactions from the OmniPath knowledge base suite.

symbol2uniprot(gene_symbols[, allow, drop_na])

Map gene symbols to UniProt accession IDs using PyPath.

uniprot2symbol(uniprot_ids[, allow, drop_na])

Map UniProt accession IDs to HGNC gene symbols using PyPath.

gsnn.gsnn.proc.bio.build_uniprot_symbol_map(func_edges)[source]

All (uniprot, gene_symbol, func_name) triples observed on any edge endpoint.

Parameters:

func_edges (pd.DataFrame) – The second return value of get_bio_interactions(). Must contain columns src, dst, source_uniprot, target_uniprot.

Returns:

Columns (in order): uniprot, gene_symbol, func_name, node_kind ('PROTEIN' or 'RNA'). One row per unique (uniprot, func_name) pair. Many-to-many: a single uniprot may appear with multiple func_names, and a single func_name may carry multiple uniprots.

Return type:

pd.DataFrame

Notes

  • Excludes synthetic COMPLEX:... accessions emitted by the OmniPath complexes endpoint (they are not real UniProt IDs).

  • Excludes miRBase-style identifiers carried on miRNA edges (anything that does not look like a UniProt accession; see _UNIPROT_RE).

  • Pure pandas; no network I/O.

  • The unique uniprots here are a strict superset of the well-formed UniProt accessions in func_nodes['uniprot'] (i.e. those matching _UNIPROT_RE), because func_nodes keeps only one accession per node via .last() while this table retains every distinct (uniprot, func_name) pair seen on any edge endpoint.

gsnn.gsnn.proc.bio.ensg2symbol(ensg_ids, allow='1:m', drop_na=True)[source]

Map Ensembl gene IDs (ENSG) to HGNC gene symbols using PyPath.

A convenience wrapper around :pyfunc:`pypath.utils.mapping.map_name` that translates Ensembl gene IDs into their corresponding HGNC gene symbols.

Two mapping strategies are available (allow):

  1. '1:m' - keep all gene symbols associated with an Ensembl ID (one-to-many, default).

  2. '1:1' - keep only the first gene symbol returned by PyPath for each Ensembl ID (one-to-one).

Parameters:
  • ensg_ids (Sequence[str] or pandas.Series) – Iterable of Ensembl gene IDs. Duplicate IDs are collapsed to the unique set for the lookup, but the returned DataFrame contains one row per combination of Ensembl ID and gene symbol.

  • allow (str, optional) – Mapping strategy; must be either '1:m' or '1:1'. Defaults to '1:m'.

  • drop_na (bool, optional) – If True, drop rows where the Ensembl ID could not be mapped to a gene symbol. Defaults to True.

Returns:

A two-column DataFrame with

  • 'ensg_id' - Ensembl gene ID (str)

  • 'gene_symbol' - HGNC gene symbol (str) or None if the Ensembl ID could not be mapped.

Return type:

pandas.DataFrame

Example

>>> from gsnn.proc.bio import ensg2symbol
>>> ensgs = pd.Series(['ENSG00000100030', 'ENSG00000171862', 'INVALID'])
>>> ensg2symbol(ensgs, allow='1:m').head()
           ensg_id gene_symbol
0  ENSG00000100030       MAPK1
1  ENSG00000171862        PTEN
gsnn.gsnn.proc.bio.get_bio_interactions(undirected=False, include_tf_mirna=False, include_pathway_extra=False, include_kinase_extra=False, include_ligrec_extra=False, include_collecTRI=False, include_dorothea=True, include_omnipath=True, dorothea_levels=['A', 'B'], gene_symbol=True, complex_handling='link', min_n_references=None, min_curation_effort=None, return_uniprot_map=False, verbose=True)[source]

Retrieve and standardise directed biological interactions from the OmniPath knowledge base suite.

The function downloads, harmonises and concatenates several curated interaction resources that are exposed through the omnipath Python package and converts them into a single DataFrame with unified node identifiers. Each identifier is prefixed with the molecular entity type so that the downstream GSNN pipeline can easily distinguish between RNA and protein nodes:

  • PROTEIN__<gene_symbol>

  • RNA__<gene_symbol>

In addition, an explicit translation edge (RNA PROTEIN) is created for every gene that is found in both the RNA and the protein namespace.

Parameters:
  • undirected (bool, optional (default=False)) – If True, the graph is made undirected by adding a reverse edge for every existing interaction.

  • include_tf_mirna (bool, optional (default=False)) – Whether to augment the graph with TF-miRNA and miRNA-target interactions.

  • include_pathway_extra (bool, optional (default=False)) – Whether to include additional pathway interactions that lack direct literature support.

  • include_kinase_extra (bool, optional (default=False)) – Whether to include additional kinase-substrate interactions that lack direct literature support.

  • include_ligrec_extra (bool, optional (default=False)) – Whether to include additional ligand-receptor interactions that lack direct literature support.

  • include_collecTRI (bool, optional (default=False)) – Whether to include CollecTRI transcription-factor regulon interactions.

  • include_dorothea (bool, optional (default=True)) – Whether to include DoRothEA transcription-factor regulon interactions.

  • include_omnipath (bool, optional (default=True)) – Whether to include curated OmniPath protein-protein interactions.

  • dorothea_levels (list[str], optional (default=['A', 'B'])) – Confidence levels to retain from the DoRothEA transcription-factor regulon resource. Valid levels are ['A', 'B', 'C', 'D'].

  • gene_symbol (bool, optional (default=True)) – If True the identifiers are returned as HGNC gene symbols. Otherwise uniprot gene identifiers are used.

  • complex_handling ({'none', 'remove', 'expand', 'link'}, optional) –

    How to deal with protein-complex entities (OmniPath encodes complexes as underscore-concatenated member gene symbols, e.g. PROTEIN__AEBP2_EED_EZH2_RBBP4_SUZ12 for PRC2):

    • 'none' - leave complex nodes untouched (legacy behaviour, kept for backwards compatibility).

    • 'remove' - drop every edge that involves a complex.

    • 'expand' - replace each complex with its constituent members, fanning out one edge per member. This recovers gene-level coverage at the cost of introducing approximate member-level edges that were not literally curated.

    • 'link' - rename complex nodes into a dedicated COMPLEX__ namespace and add explicit PROTEIN__<member> -> COMPLEX__<...> “assembly” edges, so the GSNN can learn complex activity from member activity while preserving the unit-level semantics of the curated interaction.

  • min_n_references (int or None, optional (default=None)) – If set, retain only edges supported by at least this many PubMed references (OmniPath n_references field). Datasets that do not expose the column are dropped entirely when this filter is active.

  • min_curation_effort (int or None, optional (default=None)) – If set, retain only edges whose OmniPath curation-effort score is at least this value. Datasets that do not expose the column are dropped entirely when this filter is active.

  • return_uniprot_map (bool, optional (default=False)) – If True, return a third element: the many-to-many UniProt mapping table produced by build_uniprot_symbol_map().

  • verbose (bool, optional (default=True)) – Whether to print progress updates.

Returns:

  • pandas.DataFrame – One row per function-graph node with columns ['func_name', 'uniprot', 'gene_symbol']. func_name is the prefixed node id (e.g. PROTEIN__TP53); gene_symbol is the suffix after __; uniprot is the last-seen OmniPath accession for that node.

  • pandas.DataFrame – DataFrame with columns ['src', 'dst', 'edge_type', 'source_uniprot', 'target_uniprot'] describing the directed interaction graph. source_uniprot and target_uniprot are the OmniPath source / target accession columns (typically UniProt; complexes may use COMPLEX:...; miRNA resources may use miRBase IDs). A handful of synthetic assembly edges may lack a member-level source_uniprot.

  • pandas.DataFrame, optional – Returned only when return_uniprot_map=True. The many-to-many UniProt ↔ gene-symbol ↔ func_name mapping table from build_uniprot_symbol_map().

Notes

The function prints the number of automatically generated translation edges. Depending on the local cache state, the first call may take a few seconds because the interaction tables are lazily downloaded from the OmniPath server.

Duplicate edges are collapsed on (src, dst, edge_type) only; if multiple accessions map to the same symbol after alias stripping, the first row is kept.

Several miRNA-related tables surface miRBase legacy alias strings of the form "HSA-MIR-675B;HSA-MIR-675*" – multiple historical names of the same mature miRNA joined by ';'. Only the first alias is retained as the canonical node identifier.

Examples

>>> from gsnn.proc.bio import get_bio_interactions
>>> nodes, edges = get_bio_interactions(undirected=True, include_tf_mirna=True)
>>> nodes.columns.tolist(), edges.shape
gsnn.gsnn.proc.bio.symbol2uniprot(gene_symbols, allow='1:m', drop_na=True)[source]

Map gene symbols to UniProt accession IDs using PyPath.

A convenience wrapper around :pyfunc:`pypath.utils.mapping.map_name` that translates gene symbols into their corresponding UniProt accession IDs.

Two mapping strategies are available (allow):

  1. '1:m' - keep all gene symbols associated with a UniProt ID (one-to-many, default).

  2. '1:1' - keep only the first gene symbol returned by PyPath for each UniProt ID (one-to-one).

Parameters:
  • gene_symbols (Sequence[str] or pandas.Series) – Iterable of gene symbols. Duplicate symbols are collapsed to the unique set for the lookup, but the returned DataFrame contains one row per combination of symbol and UniProt ID.

  • allow (str, optional) – Mapping strategy; must be either '1:m' or '1:1'. Defaults to '1:m'.

Returns:

A two-column DataFrame with

  • 'gene_symbol' - Gene symbol (str)

  • 'uniprot_id' - UniProt accession (str) or None if the symbol could not be mapped.

Return type:

pandas.DataFrame

Example

>>> from gsnn.proc.map import symbol2uniprot
>>> symbols = pd.Series(['MAPK1', 'PTEN', 'INVALID'])
>>> symbol2uniprot(symbols, mapping='1:m').head()
   gene_symbol uniprot_id
0      MAPK1       P38398
1      PTEN        Q9Y243
2     INVALID        None
gsnn.gsnn.proc.bio.uniprot2symbol(uniprot_ids, allow='1:m', drop_na=True)[source]

Map UniProt accession IDs to HGNC gene symbols using PyPath.

A convenience wrapper around :pyfunc:`pypath.utils.mapping.map_name` that translates protein accessions into their corresponding gene symbols.

Two mapping strategies are available (allow):

  1. '1:m' - keep all gene symbols associated with a UniProt ID (one-to-many, default).

  2. '1:1' - keep only the first gene symbol returned by PyPath for each UniProt ID (one-to-one).

Parameters:
  • uniprot_ids (Sequence[str] or pandas.Series) – Iterable of UniProt accession IDs. Duplicate IDs are collapsed to the unique set for the lookup, but the returned DataFrame contains one row per combination of accession and gene symbol.

  • allow (str, optional) – Mapping strategy; must be either '1:m' or '1:1'. Defaults to '1:m'.

Returns:

A two-column DataFrame with

  • 'uniprot_id' - UniProt accession (str)

  • 'gene_symbol' - Gene symbol (str) or None if the accession could not be mapped.

Return type:

pandas.DataFrame

Example

>>> from gsnn.proc.map import uniprot2symbol
>>> ids = pd.Series(['P38398', 'Q9Y243', 'INVALID'])
>>> uniprot2symbol(ids, mapping='1:m').head()
   uniprot_id gene_symbol
0      P38398       MAPK1
1      Q9Y243        PTEN
2     INVALID        None