{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The alfie package - demo\n", "\n", "\n", "\n", "## Part 1. Evaluate sequences with alfie's pre-built kingdom level classifier\n", "\n", "Alfie's primary function is as a kingdom-level taxonomic classifier for COI-5P barcode data. To accomplish this, alfie uses a deep neural network to analyze a set of input sequences and predict taxonomy. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read in data \n", "\n", "Alfie contains a series of functions for reading and writing DNA sequence data in fasta or fastq format. These functions are found in the `seqio` module and their import is demonstrated below. Alfie also contains two example files that we import below using the read_fasta and read_fastq functions." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import functions for fasta/fastq input and output \n", "from alfie.seqio import read_fasta, read_fastq, write_fasta, write_fastq" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': '@seq1_plantae',\n", " 'sequence': 'ttctaggagcatgtatatctatgctaatccgaatggaattagctcaaccaggtaaccatttgcttttaggtaatcaccaagtatacaatgttttaattacagcacatgcttttttaatgattttttttatggtaatgcctgtaatgattggtggttttggtaattggttagttcctattatgataggaagtccagatatggcttttcctagactaaataacatatctttttgacttcttccaccttctttatgtttacttttagcttcttcaatggttgaagtaggtgttggaacaggatgaactgtttatcctccccttagttcgatacaaagtcattcaggcggagctgttgatttagcaatttttagcttacatttatctggagcttcatcgattttaggagctgtcaattttatttctacgattctaaatatgcgtaatcctgggcaaagcatgtatcgaatgccattatttgtttgatctatttttgtaacggca',\n", " 'strand': '+',\n", " 'quality': '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#import path to example file\n", "from alfie import ex_fastq_file\n", "\n", "#read in example file\n", "example_fastq = read_fastq(ex_fastq_file)\n", "\n", "#check format\n", "example_fastq[0]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'seq1_plantae',\n", " 'sequence': 'TTCTAGGAGCATGTATATCTATGCTAATCCGAATGGAATTAGCTCAACCAGGTAACCATTTGCTTTTAGGTAATCACCAAGTATACAATGTTTTAATTACAGCACATGCTTTTTTAATGATTTTTTTTATGGTAATGCCTGTAATGATTGGTGGTTTTGGTAATTGGTTAGTTCCTATTATGATAGGAAGTCCAGATATGGCTTTTCCTAGACTAAATAACATATCTTTTTGACTTCTTCCACCTTCTTTATGTTTACTTTTAGCTTCTTCAATGGTTGAAGTAGGTGTTGGAACAGGATGAACTGTTTATCCTCCCCTTAGTTCGATACAAAGTCATTCAGGCGGAGCTGTTGATTTAGCAATTTTTAGCTTACATTTATCTGGAGCTTCATCGATTTTAGGAGCTGTCAATTTTATTTCTACGATTCTAAATATGCGTAATCCTGGGCAAAGCATGTATCGAATGCCATTATTTGTTTGATCTATTTTTGTAACGGCA'}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#repeat the above process with a fasta file\n", "from alfie import ex_fasta_file\n", "\n", "example_fasta = read_fasta(ex_fasta_file)\n", "\n", "example_fasta[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classify sequences\n", "\n", "Once we have imported the data with the seqio module, we can use the `classify_records` function to obtain taxonomic classifications for the input sequences. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# import classification function\n", "from alfie.classify import classify_records\n", "\n", "seq_records, predictions = classify_records(example_fasta)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The classification process is completed above in just one line of code. In the background the `classify_records` function is taking the input sequences, generating a set of input features (k-mer frequencies) for each sequence, and passing the feature sets through a neural network to obtain a prediction. \n", "\n", "The function yields two outputs, a list of sequence records and an array of classifications. The sequence records are a list of dictionaries just like the inputs. Now there is an additional 'kmer_data' entry, which contains a data structure used to generate the input features for the neural network (more information on the `KmerFeatures` class is provided in part 2.)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'seq1_plantae',\n", " 'sequence': 'TTCTAGGAGCATGTATATCTATGCTAATCCGAATGGAATTAGCTCAACCAGGTAACCATTTGCTTTTAGGTAATCACCAAGTATACAATGTTTTAATTACAGCACATGCTTTTTTAATGATTTTTTTTATGGTAATGCCTGTAATGATTGGTGGTTTTGGTAATTGGTTAGTTCCTATTATGATAGGAAGTCCAGATATGGCTTTTCCTAGACTAAATAACATATCTTTTTGACTTCTTCCACCTTCTTTATGTTTACTTTTAGCTTCTTCAATGGTTGAAGTAGGTGTTGGAACAGGATGAACTGTTTATCCTCCCCTTAGTTCGATACAAAGTCATTCAGGCGGAGCTGTTGATTTAGCAATTTTTAGCTTACATTTATCTGGAGCTTCATCGATTTTAGGAGCTGTCAATTTTATTTCTACGATTCTAAATATGCGTAATCCTGGGCAAAGCATGTATCGAATGCCATTATTTGTTTGATCTATTTTTGTAACGGCA',\n", " 'kmer_data': }" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_records[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predictions returned are an encoded array of kingdom classifications.\n", "Encodings are in alphabetical order: \n", "```\n", "0 == \"animalia\", 1 == \"bacteria\", 2 == \"fungi\", 3 == \"plantae\", 4 == \"protista\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3, 1, 4, 0, 0])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For some purposes, the names corresponding to the encoded predictions may be preferable to the numeric encodings. To obtain these, use the `decode_predictions`function to move from the array of numeric predictions to a list of names. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['plantae', 'bacteria', 'protista', 'animalia', 'animalia']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alfie.classify import decode_predictions\n", "\n", "predicted_kingdoms = decode_predictions(predictions)\n", "predicted_kingdoms[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Working within python provides the freedom to manipulate the sequence records in various ways using this information. What you do with the classification information will depend on your research goal. You may wish to save a select category of sequences to a file, merge the classifications with the existing sequence ids, or carry the classifications through to additional analyses in python. Here we explore a few of these possibilities.\n", "\n", "\n", "**a.** Add the predictions into the sequence dictionaries prior to subsequent manipulation." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'seq1_plantae',\n", " 'sequence': 'TTCTAGGAGCATGTATATCTATGCTAATCCGAATGGAATTAGCTCAACCAGGTAACCATTTGCTTTTAGGTAATCACCAAGTATACAATGTTTTAATTACAGCACATGCTTTTTTAATGATTTTTTTTATGGTAATGCCTGTAATGATTGGTGGTTTTGGTAATTGGTTAGTTCCTATTATGATAGGAAGTCCAGATATGGCTTTTCCTAGACTAAATAACATATCTTTTTGACTTCTTCCACCTTCTTTATGTTTACTTTTAGCTTCTTCAATGGTTGAAGTAGGTGTTGGAACAGGATGAACTGTTTATCCTCCCCTTAGTTCGATACAAAGTCATTCAGGCGGAGCTGTTGATTTAGCAATTTTTAGCTTACATTTATCTGGAGCTTCATCGATTTTAGGAGCTGTCAATTTTATTTCTACGATTCTAAATATGCGTAATCCTGGGCAAAGCATGTATCGAATGCCATTATTTGTTTGATCTATTTTTGTAACGGCA',\n", " 'kmer_data': ,\n", " 'kingdom': 'plantae'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#iterate over the predictions, use enumerate to get record number\n", "for i, p in enumerate(predicted_kingdoms):\n", " #call corresponding sequence record and add a kingdom entry to dictionary\n", " seq_records[i]['kingdom'] = p\n", " \n", "#taxonomic classification now present in the sequence dict\n", "seq_records[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b.** Use the predictions to subset out only data from a kingdom of interest." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "animal_sequences = []\n", "\n", "for i, x in enumerate(predicted_kingdoms):\n", " if x == 'animalia':\n", " animal_sequences.append(seq_records[i])\n", " \n", "#Note: you could avoid the transition to string classifications and \n", "#subset using the numeric classifications in the `predictions` array \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we see that the resulting list `animal_sequences` contains only the kingdom 'animalia'. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'name': 'seq4_animalia',\n", " 'sequence': 'AATCCGGGATCATTAATTGGTGATGATCAAATTTATAATACCATTGTTACAGCTCATGCATTTATTATAATTTTTTTTATGGTTATACCAATTATAATCGGAGGATTTGGTAATTGATTAGTACCATTGATATTAGGGGCACCTGATATAGCTTTCCCACGAATAAATAATATAAGATTTTGATTACTACCCCCTTCTTTAATACTTCTAATTTCTAGTAGTATTGTAGAAAATGGAGCTGGAACTGGATGAACAGTTTACCCCCCTTTATCATCTAATATCGCCCATGGAGGAAGATCTGTTGACTTAGCTATTTTTTCATTACATTTAGCTGGTATTTCATCTATTTTAGGAGCTATTAATTTTATT',\n", " 'kmer_data': ,\n", " 'kingdom': 'animalia'},\n", " {'name': 'seq5_animalia',\n", " 'sequence': 'CAAATTTATAATACAATTGTTACAGCCCATGCTTTTATTATAATTTTCTTTATAGTAATGCCTATTATAATTGGAGGATTTGGAAATTGATTAGTACCTTTAATATTAGGAGCCCCCGATATAGCTTTCCCCCGAATAAATAATATAAGATTTTGACTTCTCCCCCCATCATTAACCCTTTTAATTTCAAGAAGAGTTGTAGAAAATGGTACTGGAACTGGATGAACAGTTTACCCCCCTTTATCATCTAATATTGCTCATAGAGGAAGATCTGTTGATTTATCTATTTTTTCCCTTCATTTAGCTGGAATTTCTTCTATTTTAGGAGCAATTAATTTTATTACAACTATTATTAATATACGATTAAATAATATAACATTTGATCAATTACCTTTATTTGTATGATCTGTTGGAATTACAGCTCTTCTTCTTCTTCTTTCTCTTCCTGTTTTAGCAGGAGCTATTACTATATTATTA',\n", " 'kmer_data': ,\n", " 'kingdom': 'animalia'}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "animal_sequences[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**c.** Writing to file\n", "\n", "Once processing and analyses are completed, we may wish to save our sequence records to a new output file. Sequences, or lists of sequences in dictionary format can be written to output files if they possess the proper set of keys ('name' and 'sequence' for fasta, 'name', 'sequence', 'strand', and 'quality'). Additional keys will be ignored when writing the output.\n", "\n", "In this demonstration, we take the animal sequences we isolated from the input (**b.**) and write them to a new fasta file." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# if you uncomment and run the following line, it will make an output file in your current working directory\n", "#write_fasta( animal_sequences, 'animalia_example_output.fasta')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is all there is to deploying the alfie package as a kingdom level classifier from within Python. The kingdom level classification provides an efficient means of separating DNA sequences from a target kingdom from the large amount of off-target noise that can exist within a metabarcoding or environmental DNA data set.\n", "\n", "Next, we explore how the functionality of alfie can be customized to allow for isolation of target sequences on finer taxonomic scales." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Part 2. Train and test a custom, alignment-free taxonomic classifier\n", "\n", "In addition to using alfie as a kingdom level classifier, alfie's helper functions can be used to train a custom DNA barcode classification model. Custom model construction will allow for the general functionality of alfie (as a kingdom-level classifier) to be extended and specialized. Some common applications of this customization may be the training of a classifier for a sub-group of interest (i.e. an intra-group classifier, which is demonstrated here for the phylum annelida), or training a binary classifier to isolate barcodes from a specific taxonomic group (i.e. a classifier that says whether an input sequence is or is not a sequence from a teleost fish).\n", "\n", "The following demonstration will show how to train a custom binary or multiclass neural network through a combination of alfie, scikit learn, and tensorflow. alfie can be used in the training and deployment of other machine learning models, not just neural neural networks (any model that has a `predict` method is valid, which includes all models from scikit learn). For a demonstration of non neural network model deployment [see this supplementary script in the example folder](https://github.com/CNuge/alfie/blob/master/example/non_neural_net_model_example.py) that repeats the model training conducted below with a support vector machine in lieu of a neural network.\n", "\n", "Note this process is a little more involved than the default implementation of alfie. This demo assumes the reader has an understanding of the basics of data science and machine learning in Python. If you're not yet comfortable with those topics, I would recommend the book [Hands-on Machine Learning with Scikit-Learn and TensorFlow](https://github.com/ageron/handson-ml2) as a good starting point.\n", "\n", "It is also important to note that your mileage with a design and deployment of a custom classifier will vary with the quality of the training data you use. A few thousand labelled sequences at a minimum is recommended. If you're looking to acquire COI training data, consider: [the BOLD data systems website](http://www.boldsystems.org/index.php/Login/page?destination=MAS_Management_UserConsole), or [subsetting the data used in training the original alfie neural network](https://github.com/CNuge/data-alfie). \n", "\n", "If you want training data for other barcodes or genes have a look at [NCBI](https://www.ncbi.nlm.nih.gov) (warning: data mining and cleaning required), or other online barcode data sources such as the [PLANiTS dataset](https://github.com/apallavicini/PLANiTS). Another good source of barcode data is [Dr. Teresita Porter's GitHub page](https://github.com/terrimporter), which contains trained RDP Classifiers (and labelled training data!) for rbcL, 18S, ITS, and other barcodes.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import tensorflow as tf\n", "\n", "from sklearn.model_selection import KFold\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.preprocessing import LabelBinarizer" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from alfie.kmerseq import KmerFeatures\n", "from alfie.training import stratified_taxon_split, sample_seq, process_sequences, alfie_dnn_default" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the demo data\n", "\n", "The demo data can be found in the [alfie GitHub repository](https://github.com/CNuge/alfie/tree/master/example). The relative import below assumes you have downloaded the alfie repository from Github and that you are working directory is: `alfie/example`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('alfie_small_train_example.tsv', sep = '\\t')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity, this demo is conducted with 10,000 sequences from the phylum Annelida, which has only two taxonomic classes. We will train a neural network to predict the class of Annelida sequences in an alignment-free fashion. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
processidsequencephylumclassorderfamilygenus
0GAHAP309-13accttatactttattctgggcgtatgagcaggaatattgggtgcag...AnnelidaClitellataEnchytraeidaEnchytraeidaeGrania
1GAHAP2002-14accctatatttcattctcggagtttgagctggcatagtaggtgccg...AnnelidaClitellataHaplotaxidaLumbricidaeAporrectodea
2GBAN15302-19actctatacttaatttttggtatt-gagccggtatagtaggaacag...AnnelidaClitellataHaplotaxidaNaididaeAinudrilus
3GBAN11905-19acactatattttattttaggaatttgagctggaataattggagcag...AnnelidaClitellataCrassiclitellataMegascolecidaeMetaphire
4GBAN15299-19acattatacctaattta-ggtgtatgagccggaatagttggaacag...AnnelidaClitellataHaplotaxidaNaididaeAinudrilus
\n", "
" ], "text/plain": [ " processid sequence phylum \\\n", "0 GAHAP309-13 accttatactttattctgggcgtatgagcaggaatattgggtgcag... Annelida \n", "1 GAHAP2002-14 accctatatttcattctcggagtttgagctggcatagtaggtgccg... Annelida \n", "2 GBAN15302-19 actctatacttaatttttggtatt-gagccggtatagtaggaacag... Annelida \n", "3 GBAN11905-19 acactatattttattttaggaatttgagctggaataattggagcag... Annelida \n", "4 GBAN15299-19 acattatacctaattta-ggtgtatgagccggaatagttggaacag... Annelida \n", "\n", " class order family genus \n", "0 Clitellata Enchytraeida Enchytraeidae Grania \n", "1 Clitellata Haplotaxida Lumbricidae Aporrectodea \n", "2 Clitellata Haplotaxida Naididae Ainudrilus \n", "3 Clitellata Crassiclitellata Megascolecidae Metaphire \n", "4 Clitellata Haplotaxida Naididae Ainudrilus " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Clitellata 6187\n", "Polychaeta 3813\n", "Name: class, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['class'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conducting a train/test split\n", "\n", "First we sequester a test set from the input data. The aflie function `stratified_taxon_split` can be used to split a dataframe in a stratified fashion based on the taxonomic data in a column. This ensures that each taxonomic group is evenly represented in the training and test data sets." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Conducting train/test split, split evenly by: class\n" ] } ], "source": [ "train, test = stratified_taxon_split(data, class_col = 'class', test_size = 0.3, )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can call some summary functions on the output dataframes to verify the even split of the data." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train data shape: (7000, 7)\n", "Clitellata 4331\n", "Polychaeta 2669\n", "Name: class, dtype: int64\n", "\n", "\n", "test data shape: (3000, 7)\n", "Clitellata 1856\n", "Polychaeta 1144\n", "Name: class, dtype: int64\n" ] } ], "source": [ "print(\"train data shape:\",train.shape)\n", "print(train['class'].value_counts())\n", "print(\"\\n\")\n", "\n", "print(\"test data shape:\", test.shape)\n", "print(test['class'].value_counts())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encoding the predictor data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### kmer features\n", "\n", "Alfie contains a custom python class called KmerFeatures, which intakes a DNA sequence and generates k-mer count and k-mer frequency data. This class is used in the background by the `classify_records` function to generate the features for neural network prediction. \n", "\n", "Here we will use it to take our data and generate the feature sets for model training. The class takes an id and a DNA sequence, by default it will count 4mer frequencies." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "#uncomment to look at the docs\n", "#?KmerFeatures" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1 = KmerFeatures(name = train['processid'][0] , sequence = train['sequence'][0])\n", "x1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upon initiation, the KmerFeatures class instance will generate a k-mer count dictionary, where the keys are all the nucleotide permutations for size 'k'. The class then iterates through the input sequence and counts the occurrences of each k-mer. Any occurrences of nucleotides other than A, T, G, and C will cause the encompassing k-mer to be ignored.\n", "\n", "After initiation, the k-mer keys and values can be accessed like a regular python dictionary." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['AAAA',\n", " 'AAAC',\n", " 'AAAG',\n", " 'AAAT',\n", " 'AACA',\n", " 'AACC',\n", " 'AACG',\n", " 'AACT',\n", " 'AAGA',\n", " 'AAGC']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1.keys()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Note/digression: later on, we rely on the fact keys are from an alphabetically ordered dict (the default for python 3.6 or newer). Alfie sorts the dict items by key before returning them just to be safe (Python 3.6 and later are ordered dicts by default so this is redundant). This is a PSA that it is 2020 and time to update if you're on 3.5 or earlier!](https://twitter.com/raymondh/status/773978885092323328)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 3, 1, 2, 2, 3, 1, 4, 3, 1]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1.values()[:10]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('AAAA', 2),\n", " ('AAAC', 3),\n", " ('AAAG', 1),\n", " ('AAAT', 2),\n", " ('AACA', 2),\n", " ('AACC', 3),\n", " ('AACG', 1),\n", " ('AACT', 4),\n", " ('AAGA', 3),\n", " ('AAGC', 1)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1.items()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that the training data is not biased by the overall size of the sequences, it is best to train the models using the k-mer frequencies (count/total), which the KmerFeatures class also provides for us." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.0030722 , 0.00460829, 0.0015361 , 0.0030722 , 0.0030722 ,\n", " 0.00460829, 0.0015361 , 0.00614439, 0.00460829, 0.0015361 ])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1.freq_values()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For machine learning purposes, the KmerFeatures class also outputs the k-mer dictionary keys and value frequencies as numpy arrays." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['AAAA', 'AAAC', 'AAAG', 'AAAT', 'AACA'], dtype='