Bioinformatics Made Easy

Discover the future of bioinformatics with our cutting-edge HTML-based tools designed to enhance research efficiency and accelerate scientific breakthroughs.

Sequora dna

Sequora Phylogenomics - Small Genome Comparison Made Easy

©2025 Gregory S Muhs, All Rights Reserved
Release 3.17 - 31 March 2025 - Fixed a bug in the style code and added Toggle Switch Updates

Welcome to Sequora!

The purpose of this project is to make Bioinformatics easy, visually appealing, and enjoyable for scientists from any specialty, especially those who do not have a background in computer programming. My philosophy is that science should be fun, even for career scientists.

My premier tool is the following Sequora Small Genome Phylogenomic comparison tool. This tool is web-based, meaning no complex software downloads are required. It has been shown to compare 6 to 12 bacterial genomes within hours, by comparing "chunks" between each pair of genomes, and ultimately creates phylogenomic trees.

My hope is that this tool can help both myself and other scientists to accelerate our research, further our understanding of the natural world, and ultimately lead to better tools and better science later on down the road.

The name Sequora comes from a Cherokee name meaning sparrow, and Matthew 10:29-31 which reads: Are not two sparrows sold for a penny? And not one of them will fall to the ground apart from your Father. But even the hairs of your head are all numbered. Fear not, therefore; you are of more value than many sparrows. (ESV Translation)

Sequora dna

Part 1-2 - Genome FASTA Processing and CSV Export Tool

In Part 1-2, upload a FASTA file containing the genomes that you want to compare (such as bacteria, archaea or mitochondria).

You can find full genomes using the NCBI Genome Database and then combine them using my FASTA File Merging program (found right here on the Sequora website).

For tutorial and example files, click here: Primate merged (mitochondria), 12 Archaea, 26 Mycoplasma, 6 Diverse Bacteria,

Parts 1 and 2 have been combined into one step. For this step, you take one FASTA file containing several small genomes (such as prokaryotes or mitochondria). You then upload this file as the Query FASTA and then you upload it again as the Reference FASTA. This part of the program will divide each genome into "chunks" that can then be analyzed later on in the pipeline.

You then download each of the respective files which are uploaded into Part 3.

By default, the Query file is divided into 1,000 base pair chunks, and the Reference file is divided into 10,000 base pair chunks. Reference chunks are then concatenated accordingly, with the default option to combine the first, second to last, and last chunk for circularity purposes. Reverse complement chunks are handled in the same way so that the complementary sequences are not overlooked.

When I first designed this tool, I used one FASTA file with 6 bacterial genomes as both the Query and Reference file. This meant that there were 6 FASTA headers in each file. I suggest using the same file for both the Query and Reference, however, this is by no means mandatory. (I have been able to analyze files containing up to 22 larger bacterial genomes as of 21 January 2025.)

Formatting: Each FASTA file should include a newline at the end. (My FASTA File Merging program adds this in automatically, or this can be added by hitting Enter in the text editor.) Also, using a letter or number as a prefix in each genome header can be helpful when interpreting trees later on. (e.g. >A - U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome)

Advanced: For advanced purposes, users have the option to upload a different file for the Query and Reference sequences, and have the option to toggle circularity handling off. These advanced options have not yet been thoroughly tested as of 21 January 2025.

Sequora Phylogenomics

Part 1-2 FASTA Processing and CSV Export Tool

Upload Original Query FASTA:

Upload Original Reference FASTA:

Handle Circularity (for concatenated reference): (On by default)

Download Processed FASTA and CSVs:

APA Citation

Sequora dna

Sequora Part 3 - Genomic Chunk Alignment Tool

Part 3 is the Advanced DNA Alignment Tool. This is where the majority of the actual work happens.

To use this part of the pipeline, upload the respective Query and Reference files that were generated in Parts 1-2.

Hit "Run Alignment." (This is the longest part of the pipeline, and may take hours or days to complete, depending on your system and file size.) At this point, users can go for a walk, get a snack, listen to music, or read a book, while the friendly robot ⚙️ does your work for you.

If this part of the pipeline does not work perfectly the first time, don't stress. It may mean that fewer genomes should be compared at once, or that a more powerful computer is needed. And of course, you can Email me at GregoryMuhs@gmail.com if you need help.

Then Download the resulting CSV file.

This section of the program uses k-mer matches from the Query sequence chunks to filter through the Reference k-mer chunks. Those that pass this filter are then aligned using a local alignment method. The final results are reported in the generated CSV grid.

This output file can then be used in Part 4.

Sequora Part 3 - Non-Overlapping Query K-mers (Queue Capping)

Scoring Parameters

Match Score:

Mismatch Penalty:

Gap Opening Penalty:

Gap Extension Penalty:

Worker Settings

Detected CPU Cores:

Number of Workers:

Query Batch Size:

K-mer Filter Settings

Enable K-mer Filter:

K-mer Length:

Min K-mer Match Percentage:

Memory Usage Threshold (%): 80%

File Input

Upload Query FASTA File:

Upload Reference FASTA File:

✔ CSV Downloaded Successfully!

Aligned: 0 | Skipped: 0 | Total: 0

Elapsed Time: 0s

Memory Usage: 0 MB / 0 MB

Warning: Memory usage is approaching the threshold. Task assignments are being optimized.

Error: Memory usage has exceeded the threshold. Task assignments are paused to prevent crashes.

Results:

Memory Breakdown:

Data Structure	Metrics
Query Sequences	N/A
K-mer Sets	N/A
Reference Labels	N/A
Task Queue	N/A
CSV Rows	N/A
Workers	N/A

Sequora dna

Part 4 - Query Highest Scores

In Part 4A, you upload the Alignment Results file that was generated in Part 3, and it will find the highest "Score" for each Query chunk against all of the Reference chunks from each original genome. So if I am comparing 6 bacterial genomes, Chunk 1 from Bacterium A will return the highest score from Bacterium A (i.e. 2000 or 100%), the highest from Bacterium B, Bacterium C, Bacterium D, Bacterium E, and Bacterium F.

Part 4B incorporates the counts of the number of chunks generated in Part 1-2. These counts are important, because it allows the program and the user to determine how many of these sequences have a homologous counterpart, and how many are unique. In other words, did this reference chunk have a counterpart that passed the k-mer filter in Part 3 or not? This of course raises the philosophical question of where we draw the line in how homology is defined.

Part 4 also incorporates a High Score Count of the number of times the query chunk had a hit against the reference chunks. A High Score Count of 2, does not always indicate that there are two copies of this hit in the reference genome. Since overlapping chunks are generated in Part 1, a High Score Count of 2 usually just indicates that the query chunk found its homologous counterpart in two overlapping chunks during the Part 3 process.

CSV Header Cleanup Processor - Part 4A (Fully Quoted Output)

CSV Header Cleanup Processor - Part 4A

Bioinformatics Pipeline - Part 4B (Incremental)

Upload Query Chunk Count File (From Part 1): Upload Processed Data File (From Part 4A):

Sequora dna

Part 5 - Genome Scoring Averages

In Part 5 the user uploads the Filtered CSV which was generated in Part 4B. Part 5 will then produce two CSVs, with download options, summarizing the average scores between each Query and Reference genome.

Part 5 of the program finds the average values for the comparisons between each pair of genomes and calculates custom values based on these averages.

Part 5 also generates distance values which are later used in Part 6 to generate distance matrices. The pairwise comparisons between genomes may or may not be identical in their scores. For example, Bacterium A may be 80% similar to Bacterium B, while Bacterium B may be only 70% similar to Bacterium A. This is partly due to size differences between the genomes. Bacterium B might have long (or short) stretches of its genome that are not found in Bacterium A. (Part 6 will later take an average of these two values to generate a symmetric distance matrix.)

Two important values to take note of here are the Combined Metrics 1 and 2. (I am still working on a better name for these metrics and am open to suggestions.)

Combined Metric 1 is calculated by Percent Identity With Gaps * Score * Aligned Frequency / 2000 * 100

(2000 is the highest possible Score, so this makes the results more consistent with other columns.)

Combined Metric 2 is calculated by Percent Identity With Gaps * Aligned Frequency * 100.

Traditional phylogenetics tends to look at the regions that are the most similar while setting aside those regions that are the most different. Yet a major goal of science is to take into account all relevant data. This is why I wanted to create these combined metrics. By using these values, both regions that are very similar and those that are very different are included in the calculations.

How this approach will pan out, only time will tell. I look forward to hearing feedback from users.

Bioinformatics Pipeline - CSV Processor - Part5

Processed Data Preview (keyed by "Query Label | Reference Genome")

Averages Data Preview (grouped by "Query Genome | Reference Genome")

Sequora dna

CSV Merger - Optional

If multiple alignments were performed (A vs A, B vs B, A vs B, and B vs A) as separate passes through this pipeline (especially through Part 3 of the pipeline for memory purposes) then this might be a good time to combine the Part 5 outputs, before moving on to Part 6 and Part 7. To open a separate page to the optional CSV merger, click here.

Sequora dna

Part 6-7 - Generation of Phylogenomics Trees And Phylogenomics Webs

Part 6 and Part 7 Have now been combined into one step. This step generates both the phylogenomics trees and phylogenomics webs, as described below!

Part 6 (as the header indicates) generates the phylogenomic trees, based on the "Averages" output file from Part 5. This is very straightforward. You upload the "Averages" file, click the button, and a bunch of trees (a whole forest of trees) is generated.

(Since two files are generated in Part 5, make sure the Averages file is the one being uploaded here.)

The first sets of trees are generated using the distances from Combined Metric 1 and Combined Metric 2.

Combined Metric 1 is calculated by Percent Identity With Gaps * Score * Aligned Frequency / 2000 * 100

(2000 is the highest possible Score, so this makes the results more consistent with other columns.)

Combined Metric 2 is calculated by Percent Identity With Gaps * Aligned Frequency * 100.

(See explanation in Part 5 for more details.)

Percent Identity with and without gaps are included next since these are industry-standard metrics, followed by various other metrics which may each be useful for their own purposes.

If there is a distance of zero between two leaves, they are combined into one.

Sequora 3.09 - Enhancement 1 (19 March 2025) - The program now uses a combination of BioNJ, Non-Negative Least Squares (NNLS), and Partial Maximum Lilihood to compute the branch lengths between taxa. BioNJ generates the overall Newick tree, while the two supporting methods each refine the tree. While no tool is (or can be) perfect, users are encouraged to try different approaches and different methods while analyzing their genomes and generating these trees.

Sequora 3.06 - Enhancement 3 (06 February 2025) - Users can now modify the "Averages" output file from Part 5 by adding additional columns after "Distance Combined Metric 2" in Excel.

Technical Note 1: While designing this program, I kept finding weird placements for Pseudomonas species. In some phylogenomic trees, they were closely allied, while in others they were not. By conferring with ChatGPT and performing a quick literature search (via Google) it turns out that Pseudomonas genomes are highly variable. (https://doi.org/10.1038/s41598-020-69944-6) This raises a lot of questions about our approach to genomic change and adaptation and shows that a strictly "modern synthesis" view represents a very incomplete picture of life history.

Technical Note 2: As of 17 January 2025, I am still working on refining how these trees are generated. I use a modified version of the neighbor-joining method to generate these trees based on the Part 5 Averages chart. This presents a challenge: some branch lengths are generated with a label that reads as a negative number, zero, or "NaN" (Not A Number). This is a known problem in using the neighbor-joining method. Part of this seems to be related to very large (or very small) numbers that are generated during the distance and branch length calculations. When I double-check the relative positions of the taxa placements by hand, compared to the Part 5 Averages chart, they seem to check out. All of that said, a lot of my current approach is experimental. We are trying something new with this program (myself by designing this approach, and you by implementing it on your data sets). Science is an iterative endeavor, not a perfect practice, and I encourage user feedback to refine this process as I go. (See note on NaNs, Zeros, and Negative Branch Lengths above.) Update 06 February 2025: - The tree generation method has been enhanced between Sequora 3.05 and Sequora 3.06. (see Sequora 3.06 - Enhancements) That said, a lot of work is still being done to refine how branch lenghts are calculated. Since Sequora is a very nontraditional approach to phylogenomics, some of the standard "transform" methods do not entirely apply. Users are encouraged to try their own methods in Excel, if they like, and let me know what works well for them, and how this system can be improved.

(Keep scrolling for file upload and processing button.)

Sequora dna

Part 7 - Phylogenomics Webs

(Still very experimental)

This pipeline feels incomplete without a Part 7. So, here I am introducing, for the first time in history (as far as I can tell from background research), the world's first published Phylogenomics Web program. In addition to the standard "tree" diagram, this diagram shows distances between each pair of "leaves" on the tree, in the form of "phylogenomics lines" (perhaps a better term can be coined). This approach allows users to see the standard, calculated tree, while also seeing that there are other possibilities between the relationships of these organisms.

Trees are now generated based on each of nine set metrics, plus any metric that the user generated after the "Distance_Combined Metric 2" column in the "Averages_Data_Part5" CSV file.

In the future, I would love to see phylogenomics trees that take into account xenologs, which are typically seen as being the result of horizontal gene transfer (but perhaps could be the result of other mechanisms, such as convergence). Detecting xenologs was not the intent of the phylogenomics web at this stage, however, some of the lines generated from this program would seem to imply unexpected relationships between organisms being analyzed. Whether these are in fact xenologs or some other similar phenomenon, it is too early to tell... But that is why we do science, not because we know everything or that we have all the answers, but because we don't know, and we have questions to ask of the universe all around us.

Technical Note 3: As of 06 February 2025 the "leaf table" is not included in the button labelled "Download Outputs for This Metric," however I am working to fix this. For now, copy/pasting the leaf table for each metric into a CSV will be the best option.

If we knew what it was we were doing, it would not be called research, would it?
~Albert Einstein

Phylogenomic Trees – Part 6-7 (Distance_Combined + NNLS) – Abs Value Fix + Midpoint Rooting

Bioinformatics Made Easy

Discover the future of bioinformatics with our cutting-edge HTML-based tools designed to enhance research efficiency and accelerate scientific breakthroughs.

Sequora dna

Sequora Phylogenomics - Small Genome Comparison Made Easy

Welcome to Sequora!

The purpose of this project is to make Bioinformatics easy, visually appealing, and enjoyable for scientists from any specialty, especially those who do not have a background in computer programming. My philosophy is that science should be fun, even for career scientists.

My hope is that this tool can help both myself and other scientists to accelerate our research, further our understanding of the natural world, and ultimately lead to better tools and better science later on down the road.

Sequora dna

Part 1-2 - Genome FASTA Processing and CSV Export Tool

In Part 1-2, upload a FASTA file containing the genomes that you want to compare (such as bacteria, archaea or mitochondria).

You can find full genomes using the NCBI Genome Database and then combine them using my FASTA File Merging program (found right here on the Sequora website).

You then download each of the respective files which are uploaded into Part 3.

Advanced: For advanced purposes, users have the option to upload a different file for the Query and Reference sequences, and have the option to toggle circularity handling off. These advanced options have not yet been thoroughly tested as of 21 January 2025.

Part 1-2 FASTA Processing and CSV Export Tool

Download Processed FASTA and CSVs:

APA Citation

Sequora dna

Sequora Part 3 - Genomic Chunk Alignment Tool

Part 3 is the Advanced DNA Alignment Tool. This is where the majority of the actual work happens.

To use this part of the pipeline, upload the respective Query and Reference files that were generated in Parts 1-2.

Hit "Run Alignment." (This is the longest part of the pipeline, and may take hours or days to complete, depending on your system and file size.) At this point, users can go for a walk, get a snack, listen to music, or read a book, while the friendly robot ⚙️ does your work for you.

Then Download the resulting CSV file.

This section of the program uses k-mer matches from the Query sequence chunks to filter through the Reference k-mer chunks. Those that pass this filter are then aligned using a local alignment method. The final results are reported in the generated CSV grid.

This output file can then be used in Part 4.

Sequora Part 3 - Non-Overlapping Query K-mers (Queue Capping)

Scoring Parameters

Worker Settings

K-mer Filter Settings

File Input

Results:

Memory Breakdown:

Sequora dna

Part 4 - Query Highest Scores

CSV Header Cleanup Processor - Part 4A

Bioinformatics Pipeline - Part 4B (Incremental)

Sequora dna

Part 5 - Genome Scoring Averages

In Part 5 the user uploads the Filtered CSV which was generated in Part 4B. Part 5 will then produce two CSVs, with download options, summarizing the average scores between each Query and Reference genome.

Part 5 of the program finds the average values for the comparisons between each pair of genomes and calculates custom values based on these averages.

Two important values to take note of here are the Combined Metrics 1 and 2. (I am still working on a better name for these metrics and am open to suggestions.)

Combined Metric 1 is calculated by Percent Identity With Gaps * Score * Aligned Frequency / 2000 * 100

(2000 is the highest possible Score, so this makes the results more consistent with other columns.)

Combined Metric 2 is calculated by Percent Identity With Gaps * Aligned Frequency * 100.

How this approach will pan out, only time will tell. I look forward to hearing feedback from users.

Bioinformatics Pipeline - CSV Processor - Part5

Processed Data Preview (keyed by "Query Label | Reference Genome")

Averages Data Preview (grouped by "Query Genome | Reference Genome")

Sequora dna

CSV Merger - Optional

Sequora dna

Part 6-7 - Generation of Phylogenomics Trees And Phylogenomics Webs

Part 6 and Part 7 Have now been combined into one step. This step generates both the phylogenomics trees and phylogenomics webs, as described below!

Part 6 (as the header indicates) generates the phylogenomic trees, based on the "Averages" output file from Part 5. This is very straightforward. You upload the "Averages" file, click the button, and a bunch of trees (a whole forest of trees) is generated.

The first sets of trees are generated using the distances from Combined Metric 1 and Combined Metric 2.

Combined Metric 1 is calculated by Percent Identity With Gaps * Score * Aligned Frequency / 2000 * 100

(2000 is the highest possible Score, so this makes the results more consistent with other columns.)

Combined Metric 2 is calculated by Percent Identity With Gaps * Aligned Frequency * 100.

Percent Identity with and without gaps are included next since these are industry-standard metrics, followed by various other metrics which may each be useful for their own purposes.

If there is a distance of zero between two leaves, they are combined into one.

Sequora 3.06 - Enhancement 3 (06 February 2025) - Users can now modify the "Averages" output file from Part 5 by adding additional columns after "Distance Combined Metric 2" in Excel.

Sequora dna

Part 7 - Phylogenomics Webs

Trees are now generated based on each of nine set metrics, plus any metric that the user generated after the "Distance_Combined Metric 2" column in the "Averages_Data_Part5" CSV file.

Technical Note 3: As of 06 February 2025 the "leaf table" is not included in the button labelled "Download Outputs for This Metric," however I am working to fix this. For now, copy/pasting the leaf table for each metric into a CSV will be the best option.

If we knew what it was we were doing, it would not be called research, would it?~Albert Einstein

Phylogenomic Trees – Part 6-7

Unleash Your Genetic Discovery Potential

Explore our HTML-based bioinformatics tools to advance your research and unravel the mysteries within your genetic data today.

If we knew what it was we were doing, it would not be called research, would it?
~Albert Einstein