Data

Supplementary material for "Evolution of sequence-diverse disordered regions in a protein family: order within the chaos"

La Trobe University
KIM JOHNSON (Aggregated by) Thomas Shafee (Aggregated by) Tony Bacic (Aggregated by)
Viewed: [[ro.stat.viewed]] Cited: [[ro.stat.cited]] Accessed: [[ro.stat.accessed]]
ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rfr_id=info%3Asid%2FANDS&rft_id=info:doi10.26181/11775024.v1&rft.title=Supplementary material for Evolution of sequence-diverse disordered regions in a protein family: order within the chaos&rft.identifier=https://doi.org/10.26181/11775024.v1&rft.publisher=La Trobe University&rft.description=A set of supplementary data filesAccompanying publication: Evolution of sequence-diverse disordered regions in a protein family: order within the chaos Supp data file 1Excel file for the 2644 fasciclin domains, names and annotation information. In order to keep names short for phylogenies, FLAs given arbitrary identifier numbers, and fasciclin domains within them indicated by their (e.g. “>X1234_FLA.2.3” -> Fasciclin domain cluster 1, arbitrary FLA identifier number 1234, FLA fasciclin domain 2 out of 3). Numbers and colours given for fasciclin, AG, non-AG and inter-proline clusters.Fields: name = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; also used in alignments and phylogenies).number = Arbitrary ID number for the FLA sequence· Accession = Phytosome gene sequence ID for the FLA sequencefas.count = Which fasciclin domain is this within the FLA sequencefas.max = How many total fasciclin domains are in the FLA sequencefas.clust(PCA) = Initial cluster based on PCA+MClust of fasciclin domain sequencefas.clust = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)agreg.clust = Cluster based on UMAP+HDBSCAN of arabinogalactan regions (0=no cluster assigned,1=type a, 2=type b, etc.)nagreg.clust = Cluster based on UMAP+HDBSCAN of non-arabinogalactan non-fasciclin regions (0=no cluster assigned,1=type a, 2=type b, etc.)interP.clust = Cluster based on UMAP+HDBSCAN of inter-proline distance (0=no cluster assigned,1=type a, 2=type b, etc.)genus & species = Taxonomy of the organism containing the sequencetax.name = Broad taxonomic group of organism containing the sequence (not necessarily a monophyletic group)[x].col = colour used in diagrams for sequences in that clusterngly.site.[x] = Boolean (true/false) of whether the sequence contains an nglycosylation motif at that position in the sequence. Includes a number to indicate domain within a FLA with 2-fasciclin domains (see figs 2 & S8 for positions)Supp data file 2Multiple sequence alignments as fasta files for all 2644 fasciclin domains, as well as separately for each cluster A-R.Naming:Sequence names = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; see supp data file 1 fields)File names = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)Supp data file 3Phylogenies as newick files for all 2644 fasciclin domains, as well as separately for each cluster A-R.Naming:Sequence names = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; see supp data file 1 fields)File names = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)Supp data file 4An [R] script to perform the analyses shown in the publication. See also github repo TS404/FLAnnotator.&rft.creator=KIM JOHNSON&rft.creator=Thomas Shafee&rft.creator=Tony Bacic&rft.date=2024&rft_rights=CC-BY-4.0&rft_subject=Fasciclin-like arabinogalactan proteins&rft_subject=Bioinformatics&rft.type=dataset&rft.language=English Access the data

Licence & Rights:

Open Licence view details
CC-BY

CC-BY-4.0

Full description

A set of supplementary data files

Accompanying publication: Evolution of sequence-diverse disordered regions in a protein family: order within the chaos


Supp data file 1

Excel file for the 2644 fasciclin domains, names and annotation information. In order to keep names short for phylogenies, FLAs given arbitrary identifier numbers, and fasciclin domains within them indicated by their (e.g. “>X1234_FLA.2.3” -> Fasciclin domain cluster 1, arbitrary FLA identifier number 1234, FLA fasciclin domain 2 out of 3). Numbers and colours given for fasciclin, AG, non-AG and inter-proline clusters.

Fields:

  • name = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; also used in alignments and phylogenies).
  • number = Arbitrary ID number for the FLA sequence· Accession = Phytosome gene sequence ID for the FLA sequence
  • fas.count = Which fasciclin domain is this within the FLA sequence
  • fas.max = How many total fasciclin domains are in the FLA sequence
  • fas.clust(PCA) = Initial cluster based on PCA+MClust of fasciclin domain sequence
  • fas.clust = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)
  • agreg.clust = Cluster based on UMAP+HDBSCAN of arabinogalactan regions (0=no cluster assigned,1=type a, 2=type b, etc.)
  • nagreg.clust = Cluster based on UMAP+HDBSCAN of non-arabinogalactan non-fasciclin regions (0=no cluster assigned,1=type a, 2=type b, etc.)
  • interP.clust = Cluster based on UMAP+HDBSCAN of inter-proline distance (0=no cluster assigned,1=type a, 2=type b, etc.)
  • genus & species = Taxonomy of the organism containing the sequence
  • tax.name = Broad taxonomic group of organism containing the sequence (not necessarily a monophyletic group)
  • [x].col = colour used in diagrams for sequences in that cluster
  • ngly.site.[x] = Boolean (true/false) of whether the sequence contains an nglycosylation motif at that position in the sequence. Includes a number to indicate domain within a FLA with 2-fasciclin domains (see figs 2 & S8 for positions)

Supp data file 2

Multiple sequence alignments as fasta files for all 2644 fasciclin domains, as well as separately for each cluster A-R.

Naming:

  • Sequence names = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; see supp data file 1 fields)
  • File names = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)


Supp data file 3

Phylogenies as newick files for all 2644 fasciclin domains, as well as separately for each cluster A-R.

Naming:

  • Sequence names = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; see supp data file 1 fields)
  • File names = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)

Supp data file 4

An [R] script to perform the analyses shown in the publication. See also github repo TS404/FLAnnotator.



Issued: 2020-08-01

Created: 2024-12-02

This dataset is part of a larger collection

Click to explore relationships graph
Subjects

User Contributed Tags    

Login to tag this record with meaningful keywords to make it easier to discover

Identifiers