InteinFinder Output

Let's talk about the pipeline's output.

Here is the example output from one of InteinFinder's tests.

By default, the intermediate alignment files are not shown, but this time, we show them. Here it is:

$ tree if_out
if_out
|-- _done
|-- alignments
|   |-- 1_name_map.tsv
|   |-- mafft_out___seq_10___green_2018___seq_11___1___4.fa
|   |-- mafft_out___seq_10___inbase___seq_236___1___2.fa
|   |-- mafft_out___seq_10___kelley_2016___seq_1___1___3.fa
|   |-- mafft_out___seq_10___kelley_2016___seq_9___1___1.fa
|   |-- mafft_out___seq_11___green_2018___seq_11___1___4.fa
|   |-- mafft_out___seq_11___inbase___seq_236___1___2.fa
|   |-- mafft_out___seq_11___kelley_2016___seq_1___1___3.fa
|   |-- mafft_out___seq_11___kelley_2016___seq_9___1___1.fa
|   |-- mafft_out___seq_2___inbase___seq_440___2___1.fa
|   |-- mafft_out___seq_3___inbase___seq_219___1___1.fa
|   |-- mafft_out___seq_4___inbase___seq_440___2___1.fa
|   |-- mafft_out___seq_5___inbase___seq_524___1___1.fa
|   |-- mafft_out___seq_8___inbase___seq_524___1___1.fa
|   `-- mafft_out___seq_9___inbase___seq_524___1___1.fa
|-- logs
|   |-- 1_config.toml
|   |-- 2_pipeline_info.txt
|   `-- if_log.DATE.mmseqs_search.txt
|-- results
|   |-- 1_putative_intein_regions.tsv
|   |-- 2_intein_hit_checks.tsv
|   `-- 3_trimmed_inteins.faa
`-- search
    |-- cdm_db
    |   |-- 1_cdm_db_search_out.tsv
    |   `-- 2_cdm_db_search_summary.tsv
    `-- intein_db
        |-- 1_intein_db_search_out.tsv
        |-- 2_intein_db_search_with_regions.tsv
        `-- 3_intein_db_search_summary.tsv

_done

First, you see a file called _done. It is an empty file, just there so you can see at a glance if the pipeline completed successfully. This can also be useful for tracking progress of many InteinFinder runs when using a work scheduler on a shared compute cluster.

Alignments

If you set remove_aln_files = false in the config file, you will see a directory called alignments. Within, there are all the alignment files generated by the pipeline, as well as a file, 1_name_map.tsv, which shows the mapping from internal sequence names used by the pipeline to the original names of the sequences.

Note that in the other output files, the original sequence IDs are used, however, in the alignments folder, the internal names are used. So, you can use the name map file to figure out what is going on.

The alignment file names contain the internal query ID, along with the InteinFinder DB sequence ID with which it was aligned.

Logs

The logs can be found in the logs directory.

1_config.toml:
- This is the original config file copied in to the output directory.
- It is here purely for convenience....I have found that it is easy for output directories and log files to get misplaced years after running a pipeline, so this should help you stay organized.
2_pipeline_info.txt
- Contains versions of all software used in the pipeline
- Shows the config options as interpreted by InteinFinder
  - Since you don't have to supply optional values in the config file, you can use this file to see exactly what config options the pipeline has been run with.
if_log.DATE.*
- Next you will see one or more files that have this format.
- If the pipeline succeeds, you will probably only see one for MMseqs2, as it dumps a lot of info even when it is successful.
- If the pipeline fails, you should see at least a few more files in here, containing information about what went wrong.

Pipeline info

Here is an example of how the 2_pipeline_info.txt file might look:

Program Versions
================
InteinFinder version: 1.0.0-SNAPSHOT [3da10c5].
/usr/bin/mafft version: v7.490 (2021/Oct/30).
/home/ryan/software/mmseqs/bin/mmseqs version: 45111b641859ed0ddd875b94d6fd1aef1a675b7e.
/usr/bin/rpsblast+ version: rpsblast+: 2.12.0+.  Package: blast 2.12.0, build Mar  8 2022 16:19:08.
/usr/bin/makeprofiledb version: makeprofiledb: 2.12.0+.  Package: blast 2.12.0, build Mar  8 2022 16:19:08.

Working Directory
=================
/home/ryan/projects/InteinFinder/_examples/basic_usage

Config
======
((inteins_file ../../_assets/intein_sequences/all_derep.faa)
 (queries_file queries.faa) (smp_dir ../../_assets/smp) (out_dir if_out)
 (checks
  ((start_residue
    ((A 1) (C 1) (F 2) (G 2) (L 2) (M 2) (N 2) (P 1) (Q 1) (S 1) (T 1) (V 2)))
   (end_residues
    ((AN 2) (CN 2) (DY 2) (FN 1) (GN 1) (GQ 1) (HN 1) (HQ 2) (KN 2) (KQ 2)
     (LD 1) (LH 2) (NS 2) (NT 2) (PP 2) (PY 2) (RD 2) (SD 2) (SN 1) (SQ 2)
     (TH 2) (VH 2) (YN 2)))
   (end_plus_one_residue ((C 1) (S 1) (T 1)))))
 (mafft ((exe /usr/bin/mafft)))
 (makeprofiledb ((exe /usr/bin/makeprofiledb)))
 (mmseqs
  ((exe /home/ryan/software/mmseqs/bin/mmseqs) (evalue 0.001)
   (num_iterations 2) (sensitivity 5.7)))
 (rpsblast ((exe /usr/bin/rpsblast+) (evalue 0.001))) (log_level info)
 (clip_region_padding 10) (min_query_length 100) (min_region_length 100)
 (remove_aln_files true) (threads 1))

There are three sections:

Program Versions: contains info about the executable binary file as well as versions for all of the programs used in the pipeline
Working Directory:
- The working directory from which the InteinFinder executable was run
- This is useful if you want to rerun the pipeline using the same config file, as you are allowed to use relative file paths in the config files.
Config
all the options as interpreted from the config file
There will likely be more config parameters listed here than you included in your config file.
- Config files do not require you to specify optional arguments.
- But the optional arguments do show up in this data structure

Results

The results directory contains the main "results" of the pipeline.

Putative intein regions

The 1_putative_intein_regions.tsv file contains coordinates of any putative intein regions for your queries.

These regions are defined solely through homology to something any of the InteinFinder's databases (either the intein sequence database, or the conserved domain database).

While these regions will many times contain actual inteins, regions may often be mobile elements or endonucleases, or things like that as many non-intein proteins may have hits to those items in the database. That is why InteinFinder does a few more steps to check for inteins.

query: the query sequence ID
region_index: 1-indexed, region ID (queries may have multiple hit regions)
start: 1-indexed, inclusive start coordinate
end: 1-indexed, inclusive end coordinate

Intein sequence hits

Putative intein regions are defined by hits to both intein sequences and conserved domains. Because hits to intein sequences are of particular importance, there is an output file, search/intein_db/2_intein_db_search_with_regions.tsv, that gives info about the queries that had hits specifically to inteins from InteinFinder's databases, and the region in which those hits occurred.

This file is similar to the blast tab output format 6, with some added info at the beginning and the end.

The first few columns describe the query, the intein DB hit, and the region in which that hit occurred.

query
region
region_start
region_end
target

The following fields are the same as from a blast tab output format 6 file.

pident
alnlen
mismatch
gapopen
qstart
qend
tstart
tend
evalue
bits

Finally, two additional fields give the length of the query and target sequences.

qlen
tlen

Intein hit checks

Next, there is a file describing all the "checks" that InteinFinder does on the putative regions: 2_intein_hit_checks.tsv.

This file has a lot of columns and basically all of the important info that the pipeline puts out.

Name	i
query	1
region	2
intein target	3
intein start minus one	4
intein start	5
intein penultimate	6
intein end	7
intein end plus one	8
intein length	9
start residue check	10
end residues check	11
end plus one residue check	12
start position check	13
end position check	14
region check	15
overall check	16