pygenprop package

Submodules

pygenprop.assign module

Created by: Lee Bergstrand (2017)

Description: Functions for assigning genome properties.

class pygenprop.assign.AssignmentCache(interpro_member_database_identifiers: list = None, sample_name=None)[source]

Bases: object

This class contains a representation of precomputed assignment results and InterPro member database matches.

cache_property_assignment(genome_property_identifier: str, assignment: str)[source]

Stores cached assignment results for a genome property.

Parameters:
  • genome_property_identifier – The identifier of genome property.
  • assignment – An assignment of YES, NO or PARTIAL for the given genome property.
cache_step_assignment(genome_property_identifier: str, step_number: int, assignment: str)[source]

Stores cached assignment results for a genome property step.

Parameters:
  • genome_property_identifier – The identifier of the genome property for which the step belongs.
  • step_number – The steps number.
  • assignment – An assignment of YES or NO for the given step.
flush_property_from_cache(genome_property_identifier)[source]

Remove a genome property from the cache using its identifier.

Parameters:genome_property_identifier – The identifier of the property to remove from the cache.
genome_property_identifiers

Creates a set of identifiers belonging to the genome properties cached.

Returns:A set of genome property identifiers.
get_property_assignment(genome_property_identifier)[source]

Retrieves cached assignment results for a genome property.

Parameters:genome_property_identifier – The identifier of genome property.
Returns:An assignment of YES, NO or PARTIAL for the given genome property.
get_step_assignment(genome_property_identifier: str, step_number: int)[source]

Retrieves cached assignment results for a genome property step.

Parameters:
  • genome_property_identifier – The identifier of the genome property for which the step belongs.
  • step_number – The steps number.
Returns:

An assignment of YES or NO for the given step.

pygenprop.assign.assign_evidence(assignment_cache: pygenprop.assign.AssignmentCache, current_evidence: pygenprop.evidence.Evidence)[source]

Assigns a result (YES, NO) to a evidence based of the presence or absence of InterPro member identifiers or the assignment of evidence child genome properties.

Parameters:
  • assignment_cache – A cache containing step and property assignments and InterPro member database matches.
  • current_evidence – The current evidence which needs assignment.
Returns:

The assignment for the evidence.

pygenprop.assign.assign_functional_element(assignment_cache: pygenprop.assign.AssignmentCache, functional_element: pygenprop.functional_element.FunctionalElement)[source]

Assigns a result (YES, NO) to a functional element based on assignments of its evidences.

Parameters:
  • assignment_cache – A cache containing step and property assignments and InterPro member database matches.
  • functional_element – The current functional_element which needs assignment.
Returns:

The assignment for the functional element.

pygenprop.assign.assign_genome_property(assignment_cache: pygenprop.assign.AssignmentCache, genome_property: pygenprop.genome_property.GenomeProperty)[source]

Recursively assigns a result to a genome property and its children.

Parameters:
  • assignment_cache – A cache containing step and property assignments and InterPro member database matches.
  • genome_property – The genome property to assign the results to.
Returns:

The assignment results for the genome property.

pygenprop.assign.assign_step(assignment_cache: pygenprop.assign.AssignmentCache, step: pygenprop.step.Step)[source]

Assigns a result (YES, NO) to a functional element based on assignments of its functional elements.

Parameters:
  • assignment_cache – A cache containing step and property assignments and InterPro member database matches.
  • step – The current step element which needs assignment.
Returns:

The assignment for the step.

pygenprop.assign.calculate_property_assignment_from_all_steps(child_assignments: list)[source]

Takes the assignment results from all child results and uses them to assign a result for the parent itself. This algorithm is used to assign results to a single step from child functional elements and for genome properties that have no required steps such as “category” type genome properties. This is a more generic version of the algorithm used in assign_property_result_from_required_steps()

If all child assignments are No, parent should be NO. If all child assignments are Yes, parent should be YES. Any thing else in between, parents should be PARTIAL.

Parameters:child_assignments – A list of assignment results for child steps or genome properties.
Returns:The parents assignment result.
pygenprop.assign.calculate_property_assignment_from_required_steps(required_step_assignments: list, threshold: int = 0)[source]

Takes the assignment results for each required step of a genome property and uses them to assign a result for the property itself. This is the classic algorithm used by EBI Genome Properties.

From: https://genome-properties.readthedocs.io/en/latest/calculating.html

To determine if the GP resolves to a YES (all required steps are present), NO (too few required steps are present) or PARTIAL (the number of required steps present is greater than the threshold, indicating that some evidence of the presence of the GP can be assumed).

Child steps must be present (‘YES’) not partial.

In Perl code for Genome Properties:

Link: https://github.com/ebi-pf-team/genome-properties/blob/ a76a5c0284f6c38cb8f43676618cf74f64634d33/code/pygenprop/GenomeProperties.pm#L646

#Three possible results for the evaluation if($found == 0 or $found <= $def->threshold){

$def->result(‘NO’); #No required steps found
}elsif($missing){
$def->result(‘PARTIAL’); #One or more required steps found, but one or more required steps missing
}else{
$def->result(‘YES’); #All steps found.

}

If no required steps are found or the number found is less than or equal to the threshold –> No Else if any are missing –> PARTIAL ELSE (none are missing) –> YES

So for problem space ALL_PRESENT > THRESHOLD > NONE_PRESENT:

YES when ALL_PRESENT = CHILD_YES_COUNT PARTIAL when CHILD_YES_COUNT > THRESHOLD NO when CHILD_YES_COUNT <= THRESHOLD

Parameters:
  • required_step_assignments – A list of assignment results for child steps or genome properties.
  • threshold – The threshold of ‘YES’ assignments necessary for a ‘PARTIAL’ assignment.
Returns:

The parent’s assignment result.

pygenprop.assign.calculate_step_or_functional_element_assignment(child_assignments: list, sufficient_scheme=False)[source]

Assigns a step result or functional element result based of the assignments of its children. In the case of steps, this would be functional element assignments. In the case of functional elements this would be evidences.

For assignments from child genome properties YES or PARTIAL is considered YES.

See: https://github.com/ebi-pf-team/genome-properties/blob/ a76a5c0284f6c38cb8f43676618cf74f64634d33/code/modules/GenomeProperties.pm#L686

if($evObj->gp){
if(defined($self->get_defs->{ $evObj->gp })){

# For properties a PARTIAL or YES result is considered success if( $self->get_defs->{ $evObj->gp }->result eq ‘YES’ or

$self->get_defs->{ $evObj->gp }->result eq ‘PARTIAL’ ){ $succeed++;
}elsif($self->get_defs->{ $evObj->gp }->result eq ‘UNTESTED’){
$step->evaluated(0);
Parameters:
  • sufficient_scheme – If false, any child NOs mean NO. If true, any child YES/PARTIAL means YES
  • child_assignments – A list containing strings of YES, NO or PARTIAL
Returns:

The assignment as either YES or NO.

pygenprop.assignment_file_parser module

Created by: Lee Bergstrand (2017)

Description: A parser for parsing genome properties longform files.

pygenprop.assignment_file_parser.parse_genome_property_longform_file(longform_file)[source]

Parses longform genome properties assignment files.

Parameters:longform_file – A longform genome properties assignment file handle object.
Returns:An assignment cache object.
pygenprop.assignment_file_parser.parse_interproscan_file(interproscan_file)[source]

Parses InterProScan TSV files into an assignment cache.

Parameters:interproscan_file – A InterProScan file handle object.
Returns:An assignment cache object.

pygenprop.database_file_parser module

Created by: Lee Bergstrand (2017)

Description: A parser for parsing genome properties flat files into a rooted DAG of genome properties.

pygenprop.database_file_parser.create_marker_and_content(genome_property_flat_file_line)[source]

Splits a list of lines from a genome property file into marker, content pairs.

Parameters:genome_property_flat_file_line – A line from a genome property flat file line.
Returns:A tuple containing a marker, content pair.
pygenprop.database_file_parser.extract_identifiers(identifier_string)[source]

Parse database or Genprop identifiers from an EV or TG tag content string.

Parameters:identifier_string – The contents string from a EV or TG tag.
Returns:A list of identifiers.
pygenprop.database_file_parser.parse_database_references(genome_property_record)[source]

Parses database reference from a genome properties record.

Parameters:genome_property_record – A list of marker, content tuples representing genome property flat file lines.
Returns:A list of DatabaseReference objects.
pygenprop.database_file_parser.parse_evidences(genome_property_record)[source]

Parses evidences from a genome properties record.

Parameters:genome_property_record – A list of marker, content tuples representing genome property flat file lines.
Returns:A list of evidence objects.
pygenprop.database_file_parser.parse_functional_elements(genome_property_record)[source]

Parses functional_elements from a genome properties record.

Parameters:genome_property_record – A list of marker, content tuples representing genome property flat file lines.
Returns:A list of functional_element objects.
pygenprop.database_file_parser.parse_genome_properties_flat_file(genome_property_file)[source]

A parses a genome property flat file.

Parameters:genome_property_file – A genome property file handle object.
Returns:A GenomePropertyTree object.
pygenprop.database_file_parser.parse_genome_property(genome_property_record)[source]

Parses a single genome property from a genome property record.

Parameters:genome_property_record – A list of marker, content tuples representing genome property flat file lines.
Returns:A single genome property object.
pygenprop.database_file_parser.parse_literature_references(genome_property_record)[source]

Parses literature references from a genome properties record.

Parameters:genome_property_record – A list of marker, content tuples representing genome property flat file lines.
Returns:A list of LiteratureReference objects.
pygenprop.database_file_parser.parse_single_evidence(current_evidence_dictionary)[source]

The creates an Evidence object from a pair of EV and TG tag content strings.

Parameters:current_evidence_dictionary – A dictionary containing EV and TG to content string mappings.
Returns:An Evidence object.
pygenprop.database_file_parser.parse_steps(genome_property_record)[source]

Parses steps from a genome properties record.

Parameters:genome_property_record – A list of marker, content tuples representing genome property flat file lines.
Returns:A list of Step objects.
pygenprop.database_file_parser.unwrap_genome_property_record(genome_property_record)[source]

The standard genome property record wraps every 80 lines. This function unwraps the record.

Parameters:genome_property_record – A list of marker, content tuples representing genome property flat file lines.
Returns:A list of reduced redundancy markers, content tuples representing genome property flat file lines. Consecutive markers (often ‘CC’ and ‘**’) markers are collapsed to one tuple.

pygenprop.database_reference module

Created by: Lee Bergstrand (2017)

Description: The database reference class.

class pygenprop.database_reference.DatabaseReference(database_name, record_title, record_ids)[source]

Bases: object

A class representing an external database reference for a genome property.

pygenprop.evidence module

Created by: Lee Bergstrand (2017)

Description: The evidence class.

class pygenprop.evidence.Evidence(evidence_identifiers=None, gene_ontology_terms=None, sufficient=False, parent: pygenprop.functional_element.FunctionalElement = None)[source]

Bases: object

A piece of evidence (ex. InterPro HMM hit or GenProp) that supports the existence of a functional element.

consortium_identifiers
Gets the InterPro consortium signature identifiers (PFAM, TIGRFAM, etc.) representing a piece of evidence.
Returns:A set of genome property identifiers.
genome_properties

Get genome properties that are used by this evidence.

Returns:A list of genome properties.
genome_property_identifiers

Gets the genome properties identifiers representing a piece of evidence.

Returns:A list of genome property identifiers.
has_genome_property

Is the evidence a genome property?

Returns:Return True if evidence is a genome property.
interpro_identifiers

Gets the InterPro (IPRXXXXXX) identifiers representing a piece of evidence.

Returns:A list of genome property identifiers.

pygenprop.functional_element module

Created by: Lee Bergstrand (2017)

Description: The functional element class.

class pygenprop.functional_element.FunctionalElement(identifier, name, evidence: list = None, required=False, parent: pygenprop.step.Step = None)[source]

Bases: object

A functional element (enzyme, structural component or sub-genome property) that can carry out a step.

pygenprop.genome_property module

Created by: Lee Bergstrand (2017)

Description: The genome property class.

class pygenprop.genome_property.GenomeProperty(accession_id, name, property_type, threshold=0, parents=None, children=None, references=None, databases=None, steps=None, public=True, description=None, private_notes=None, tree=None)[source]

Bases: object

Represents a EBI genome property. Each represents specific capabilities of an organism as proven by the presence of genes found in its genome.

child_genome_property_identifiers

Collects the genome property identifiers of child genome properties.

Returns:A list of genome property identifiers.
required_steps

Returns a list of all the required steps of the genome property.

Returns:All required steps as list.
to_json(as_dict=False)[source]

Converts the object to a JSON representation.

Parameters:as_dict – Return a dictionary for incorporation into other json objects.
Returns:A JSON formatted string or dictionary representing the object.

pygenprop.lib module

Created by: Lee Bergstrand (2017)

Description: A set of helper functions.

pygenprop.lib.sanitize_cli_path(cli_path)[source]

Performs expansion of ‘~’ and shell variables such as “$HOME” into absolute paths.

Parameters:cli_path – The path to expand
Returns:An expanded path.

pygenprop.literature_reference module

Created by: Lee Bergstrand (2017)

Description: The literature reference class.

class pygenprop.literature_reference.LiteratureReference(number, pubmed_id, title, authors, citation)[source]

Bases: object

A class representing a literature reference supporting the existence of a genome property.

pygenprop.results module

Created by: Lee Bergstrand (2018)

Description: The genome property tree class.

class pygenprop.results.GenomePropertiesResults(*genome_properties_results, properties_tree: pygenprop.tree.GenomePropertiesTree)[source]

Bases: object

This class contains a representation of a table of results from one or more genome properties assignments.

differing_property_results

Property results where all properties differ in assignment in at least one sample. :return: A property result data frame where properties with the all the same value are filtered out.

differing_step_results

Step results where all steps differ in assignment in at least one sample. :return: A step result data frame where properties with the all the same value are filtered out.

generate_json_tree(genome_properties_root)[source]

Creates a tree based representation of the genome properties assignment results.

Parameters:genome_properties_root – The root element of the genome properties tree.
Returns:A nested dict of assignment results.
get_property_result(genome_property_id)[source]

Gets the assignment results for a given genome property.

Parameters:genome_property_id – The id of the genome property to get results for.
Returns:A list containing the assignment results for the genome property in question.
get_results(*property_identifiers, steps=False, names=False)[source]

Creates a results dataframe for only a subset of genome properties.

Parameters:
  • property_identifiers – The id of one or more genome properties to get results for.
  • steps – Add steps to the dataframe.
  • names – Add property and or step names to the dataframe.
Returns:

A dataframe with results for a specific set of genome properties.

get_results_summary(*property_identifiers, steps=False, normalize=False)[source]

Creates a summary table for yes, no and partial assignments of a given set of properties or property steps. Display counts or percentage of yes no partial assignment for the given properties or steps of the given properties.

Parameters:
  • property_identifiers – The id of one or more genome properties to get results for.
  • steps – Summarize results for the steps of the input properties
  • normalize – Display the summary as a percent rather than as counts.
Returns:

A summary table dataframe

get_step_name(property_identifier, step_number)[source]

Helper function to quickly acquire a property steps name.

Parameters:
  • property_identifier – The id of the genome property.
  • step_number – The step number of the step.
Returns:

The steps name.

get_step_result(genome_property_id, step_number)[source]

Gets the assignment results for a given step of a genome property.

Parameters:
  • genome_property_id – The id of the genome property that the step belongs too.
  • step_number – The step number of the step.
Returns:

A list containing the assignment results for the step in question.

static remove_results_with_shared_assignments(results, only_drop_no_assignments=False)[source]

Filter out results where all samples have the same value. :param results: A step or property results data frame. :param only_drop_no_assignments: Only drop results where values are all NO. :return: A step or property data frame with certain properties filtered out.

supported_property_results

Property results where properties which are not supported in any sample are removed. :return: A property result data frame where properties with the all NO values are filtered out.

supported_step_results

Step results where steps which are not supported in any sample are removed. :return: A step result data frame where steps with the all NO values are filtered out.

to_json(file_handle=None)[source]

Returns a JSON representation of the step results. :return: A nested dict of the assignment results and sample names.

pygenprop.results.bootstrap_assignments(assignment_cache, genome_properties_tree)[source]

Recursively fills in assignments for all genome properties in the genome properties tree based of existing cached assignments and InterPro member database identifiers.

Parameters:
  • assignment_cache – A cache containing step and property assignments and InterPro member database matches.
  • genome_properties_tree
Returns:

pygenprop.results.create_assignment_tables(genome_properties_tree: pygenprop.tree.GenomePropertiesTree, assignment_cache: pygenprop.assign.AssignmentCache)[source]

Takes a results dictionary from the long form parser and creates two tables. One for property results and one for step results. The longform results file has only leaf assignment results. We have to bootstrap the rest.

Parameters:
  • genome_properties_tree – The global genome properties tree.
  • assignment_cache – Per-sample genome properties results from the long form parser.
Returns:

A tuple containing an property assignment table and step assignments table.

pygenprop.results.create_step_table_rows(step_assignments)[source]

Unfolds a step result dict of dict and yields a step table row.

Parameters:step_assignments – A dict of dicts containing step assignment information ({gp_key -> {stp_key –> result}})
pygenprop.results.create_synchronized_assignment_cache(assignment_cache, genome_properties_tree)[source]

Remove genome properties from the assignment cache that are not found in both the genome properties tree and the assignment cache. This prevents situations where different versions of the cache and tree cannot find each others genome properties.

Parameters:
  • assignment_cache – A cache containing step and property assignments and InterPro member database matches.
  • genome_properties_tree – The global genome properties tree.
Returns:

An assignment cache containing data for genome properties shared between the tree and cache.

pygenprop.step module

Created by: Lee Bergstrand (2017)

Description: The step class.

class pygenprop.step.Step(number, functional_elements: list = None, parent: pygenprop.genome_property.GenomeProperty = None)[source]

Bases: object

A class representing a step that supports the existence of a genome property.

genome_properties

Collects all the child genome properties supporting a step.

Returns:A list of child genome properties for a step.
genome_property_identifiers

Collects all the genome properties identifiers supporting a step.

Returns:A list of the steps child genome property identifiers.
name

Get the name for a step based on combine the names of its functional elements.

Returns:The name of the step.
required

Checks if the step is required by checking if any of the functional elements are required.

Returns:True if the step is required.

pygenprop.tree module

Created by: Lee Bergstrand (2018)

Description: The genome property tree class.

class pygenprop.tree.GenomePropertiesTree(*genome_properties)[source]

Bases: object

This class contains a representation of a set of nested genome properties. Internally, the instantiated object contains a rooted DAG of genome properties connected from root to leaf (parent to child). A dictionary is also included which points to each tree node for fast lookups by genome property identifier.

build_genome_property_connections()[source]

Build connections between parent-child genome properties in the dictionary. This creates the rooted DAG.

consortium_identifiers

All InterPro consortium signature identifiers (PFAM, TIGRFAM, etc.) used by the genome properties database.

Returns:A set of all unique consortium identifiers used in genome properties.

Creates a JSON representation of a genome property links.

Parameters:as_list – Return as a list instead of a JSON formatted string.
Returns:A JSON formatted string of a list of each properties JSON representation.
create_graph_nodes_json(as_list=False)[source]

Creates a JSON representation of a genome property dictionary.

Parameters:as_list – Return as a list instead of a JSON formatted string.
Returns:A JSON formatted string of a list of each properties JSON representation.
create_metabolism_database_mapping_file(file_handle)[source]

Writes a mapping file which maps each genome property to KEGG and MetaCyc.

Parameters:file_handle – A python file handle object.
create_nested_json(current_property=None, as_dict=False)[source]

Converts the object to a nested JSON representation.

Parameters:
  • current_property – The current root genome property (for recursion)
  • as_dict – Returns Return a dictionary for incorporation into other json objects.
Returns:

A JSON formatted string or dictionary representing the object.

genome_property_identifiers

The identifiers all genome properties in the database.

Returns:A set of all genome property identifiers.
get_evidence_identifiers(consortium=False)[source]

Gets evidence identifiers from all genome properties in the database.

Parameters:consortium – If true, list the consortium signature identifiers (PFAM, TIGRFAM)
Returns:A set of all unique evidence identifiers used in genome properties.
interpro_identifiers

All global InterPro identifiers (IPRXXXX, etc.) used by the genome properties database.

Returns:A set of all unique InterPro identifiers used in genome properties.
leafs

Returns the leaf nodes of the rooted DAG.

Returns:A list of all genome property objects with no children.
root

Gets the top level genome properties object in a genome properties tree.

Returns:The root genome property of the genome properties tree.
to_json(nodes_and_links=False)[source]

Converts the object to a JSON representation.

Parameters:nodes_and_links – If True, returns the JSON in node and link format.
Returns:A JSON formatted string representing the genome property tree.

Module contents