Expression Technologies Inc.
In molecular biology, protein yield means the recombinant protein expression level or the quantity of protein production in a defined volume of a culture. The quantity is measured in grams, milligrams, or micrograms. The defined volume is often a liter. If a protein yield is in grams per liter, this protein yield is high which is in the range of pharmaceutical production. If a protein yield is in milligrams per liter, the protein yield is intermediate which is sufficient for most biochemical analysis. If a protein yield is in micrograms per liter, the protein yield is low which is only enough for limited biochemical studies. Various expression technologies may be used to increase protein yield. The main control of protein yield appears to be at the transcription level, though other regulations such as DNA replication and protein translation are also important. We will examine the factors related to protein yield individually.
Contents of protein yield
Most recombinant protein expressions are achieved in heterologous hosts, namely the proteins are expressed in the cell lines or cell strains other than where they are produced in their native environments. The origins of the recombinant proteins are often from mammals such as human or mouse. The heterologous hosts are E.coli, yeast, insect, and mammalian cells. At production scale, the protein yields in these hosts are similar in grams per liter. At laboratory scale, the protein yields from E.coli or insect cells are often higher than those from yeast or mammalian cells. A recombinant protein may be expressed in all available hosts. It may be also expressed in one host only.
Eukaryotic cells with different genetic background are called cell lines. Prokaryotic cells with different genetic background are often termed cell strains. We use cell strains here since most of the protein expression we discussed is in E.coli. In a chosen host, there are many cell strains available for protein expression. Protein yield can be significantly different in different cell strains. Some cell strains may supply rare tRNAs. Others may promote disulfide bond formation. Still others may reduce protein toxicity. The needs of a particular protein may be examined from the existing knowledge and experiments. Cell strains may be chosen accordingly.
Expression vector and protein yield
An expression vector must contain structure units that allow protein expression. These structural units include at least a promoter, a ribosome binding site (rbs), a start codon, a stop codon, and a terminator which are required for recombinant protein expression in a host cell. In addition, the expression vector has to contain a selection marker and replication origin for the production and selection of the vector in a host cell. All these structural units directly or indirectly determine the expression level of the recombinant protein or the protein yield. It should be stressed that the targeting recombinant protein itself also determines its expression level.
Promoter strength determines the mRNA level of the recombinant protein. Under normal conditions, the stronger the promoter is, the higher protein yield may be obtained. For most toxic proteins, weaker promoter gives higher protein yield. In these cases, less is more and more is less. The un-induced or leaky expression is presumably responsible for the observation. Commonly used promoters used in E.coli expression are either from native E.coli genes or from bacteriophages.
Phage promoters usually allow transcription at high specificity and rate. Some E.coli promoters such as Ptrc and Ptac also permit high transcription rate.
Protein synthesis or translation machinery ribosome binds at the ribosome binding site (rbs) which is also termed Shine-Dalgarno sequence. The consensus rbs sequence is UAAGGAGG. Some E.coli genes do not have the consensus rbs sequence, but they still allow efficient protein translation. It is reported that the secondary structure of rbs is important for the ribosome binding or translation initiation. The 5' end capping and secondary structure may also enhance the mRNA stability and therefore increase protein yield. Optimal transcription initiation may be obtained from the consensus rbs sequence; therefore protein yield may be increased accordingly. The rbs sequence locates up stream of start codon AUG which is different from eukaryote Kozak sequence. Kozak sequence flanking the start codon is recognized by ribosome as translation initiation site.
It is reported that different protein yield may be obtained from a different rbs for a protein. For a chosen recombinant protein, different rbs may give different expression level.
The spacing between rbs and start codon AUG is important for efficient translation initiation and protein yield. The optimal spacing appears to be 7 + 2 nucleotides. However it has been reported that as few as 4 nucleotides or as many as 14 nucleotides worked with lower efficiency.
The percentages of E.coli genes use AUG and GUG as start codons are about 80% and 15% respectively. Other codon may also be used as start codons but with lower frequency. Most genes from bacteria viruses or phages use AUG as start codon. AUG may give better start than other codons in translation initiation.
All organisms use three stop codons UAA, UAG, AND UGA. E.coli cells use UAA at much higher frequency than the other two codons. Eukaryotes do not exhibit this preference in stop codon usage. Multiple stop codons may increase the transcription termination efficiency. Together with stop codons, the transcription terminator sequences down stream of the stop codons are responsible for the transcription termination. Efficient transcription termination minimizes the cellular energy drain and reduces the metabolic burden for the host. More importantly, the transcription terminator forms secondary structure at 3' end of the mRNA, improves the stability of the mRNA, and therefore increases the protein yield.
Most expression vectors contain multiple stop codons in three reading frames and efficient transcription terminators. To express a eukaryotic gene in E.coli, changing the stop codon to TAA in cloning may increase the transcription termination efficiency and the translation termination accuracy.
The replication origin determines the copy number of the expression vector in a host cell. Many highly expressed genes in their native cells contain multiple copies. Plasmid copy number ranges from a few copies to hundreds of copies. High copy number expression vectors normally give high protein yield for non-toxic proteins. High copy number also drains cellular energy and is a major metabolic burden for the host cells. In addition, high copy number also increases the toxicity of the recombinant protein. Therefore many expression vectors use low or intermediate copy number replication origins derived from pBR322 or pACYC plasmids. High copy number origins such as those from pUC plasmids are mostly used for cloning purposes.
Most common selection markers used on expression vectors are ampicillin, chloramphenicol, kanamycin, and tetracycline resistance genes. The popularity of these selection marker genes are more or less in this listed order. The degree of toxicity to the host cells of these gene products combined with their respective antibiotics may contribute the popularity of these selection marker genes. Expressing a recombinant protein in expression vectors with different selection markers clearly result in different protein yield although all other conditions are the same. The mechanism of the differences is not known.
The most commonly used regulatory gene in protein expression is lacI
repressor gene. The following over-simplified chemical equilibrium represents the
binding between lacI repressor (R) and its DNA binding site lac operator (O).
In addition to lacI repressor gene, other regulatory genes may also be incorporated on expression vectors. All regulatory genes are important for DNA cloning and protein expression.
Targeting protein cDNA and protein yield
When a eukaryote gene is cloned for expression in E.coli, only DNA sequences from start and stop codons are needed. Sequences flanking the start and stop codons of the cDNA are often provided from the chosen expression vector. In most cases, the cDNA sequences are not modified before cloning into an expression vector. However the cDNA sequences may affect protein expression. E.coli cells often use AUG and UAA as start and stop codons. Using AUG and UAA as start and stop codons respectively is a general practice in cloning step. High 5' end GC contents of the transcribed mRNA may form secondary structures. These secondary structures may reduce or stop protein translation. Minimizing the 5' GC contents will eliminate the secondary structure formation and increase protein yield. This may be achieved using AT-rich amino acid codons. GC contents on other part of the cDNA seem to have less effect on protein yield.
All organisms use 20 amino acids, but 64 codons are used to encode these 20 amino acids plus three stop codons. Only two amino acids Met and Trp are encoded by a single codon. All other amino acids are encoded by multiple codons. This raises the possibility that different codons for the same amino acid may exhibit different translation efficiency. This is indeed the case. Different organisms have different codon preference. Consistence with this preference, different amounts of tRNA are available for recognizing different codons. Some codons are highly used in mammalian cells, but they are rarely used in bacteria. The bacterial cells may not have sufficient amount of tRNA to handle the expression of the protein with multiple rare codons especially at the N-terminus of the protein. As a result, the yield of protein will be low. Please see Technologies to improve protein yield caused by rare codons for more information.
Rare codons at the first 20 amino acids, sometime times at first 50 amino acids, appear to affect protein yield significantly. Rare codons after first 50 amino acids do not have significant effects on protein yield. However clusters of the same rare codon can pause or stop the translation even the clusters are located at after first 50 amino acids. These clusters of the same rare codon will cause premature translation termination. The resulting truncated protein may not be correctly folded. Incorrectly folded soluble protein is not stable and is susceptible to protein degradation. Careful examination of clusters of the same rare codon is important for protein yield.
The structural proteins of bacteriophages are highly expressed in E.coli hosts. Many recombinant proteins are also highly expressed when the first 5 to 10 amino acids from structural proteins of bacteriophages are added to the N-termini of the proteins. Some N-termini of the fusion partners are engineered to contain these amino acids. This is why the expression levels of the fusion recombinant proteins are high.
Recent studies indicate protein coding sequence itself is important for protein yield. Eighteen amino acids and translation stop are encoded by multiple codons. A single protein may be represented by a large number of coding sequences with different codons encoding the same amino acids. Protein expression levels of these different coding sequences can be 250 times different. It was observed that coding sequences also affect their mRNA level, mRNA degradation, and the host cell growth rate. It was concluded that codon bias was not responsible for the expression variation. The stability of mRNA folding near the ribosomal binding site and associated rates of translation initiation play dominant role in determining protein expression level. It appears that the protein expression levels of different coding sequences are empirical. There are no general rules to determine a coding sequence of a protein that will lead to high expression level. Pharmaceutically important proteins may justify the resources to test large number of coding sequences.
Other factors and protein yield
Under most conditions, protein yield is proportional to the cell mass of the host cells. The cell mass is equal to the cell density times the cell volume or the culture volume. Using a larger volume and increasing the cell density are the most common ways to increase protein yield. The culture volume is limited by laboratory or production settings. The cell density is related to the culture conditions and the growth medium. Changing culture condition from shake flask to fermentation will increase cell density 10 times or higher. A fed batch fermentor can reach a cell density of OD600 = 30 to 50 for most E.coli strains. E.coli cells can reach a cell density up to OD600 = 200 to 250 under an optimized fermentation condition. Protein yield may be increased from milligrams to grams per liter.
In addition to culture condition, growth medium can also increase cell density and therefore protein yield. High density growth media support high density E.coli growth. In a shake flask under normal aeration conditions, common media such as LB can support E.coli growth up to a cell density of OD600 = 2 to 3. Richer media such as TB can grow E.coli to OD600 = 5 to 8. By contrast, all of our proprietary high density bacterial growth media can grow E.coli to a cell density of OD600 = 30 to 50. This is over ten times higher than LB and over five times higher than TB. As a result, 5 to 10 times more plasmid DNA or protein can be produced in our high density growth media. More about growth medium...>
Ampicillin is the mostly used and therefore best studied antibiotic. The expression of ampicillin resistance gene or β-lactamase protein per se does not appear to have significant impact on protein yield. However lost of selection is one of the major reasons of low protein yield with ampicillin selection marker. Ampicillin may be degraded chemically under acidic condition of the medium or by β-lactamase. At high cell density with insufficient aeration, the culture medium may reach pH 4 or lower. Ampicillin will be chemically degraded at this or lower pH. Lost of antibiotic in the medium will result the growth of the cells without expression vectors and will lower protein yield. Ampicillin analog carbenicillin is more stable at acidic pH. Using carbenicillin in place of ampicillin, providing additional ampicillin at induction, or increasing the pH will improve the selection and protein yield. E.coli cells appear to tolerate ampicillin at a large range of concentrations. Commonly used ampicillin concentration is from 50 to 200 ug/ml medium under shake flask conditions.
Chloramphenicol is the second mostly used antibiotic. Its selection marker is often used on the plasmid that co-expressed with an ampicillin selection plasmid. E.coli cells also tolerate chloramphenicol at a large range of concentrations. Chloramphenicol concentrations of 30 to 150 ug/ml may be used in a shake flask container. Chloramphenicol does appear to be easily degraded as ampicillin.
Kanamycin selection marker is also often used on the plasmids that co-express with ampicillin plasmids. E.coli cells do not have high tolerance of kanamycin and tetracycline. Kanamycin and tetracycline are used at 30 to 50 ug/ml and 10 to 20 ug/ml respectively.
The percentage of cells containing plasmid may be tested by growing cell on the plates with or without antibiotic. The ratio of cell number growing on antibiotic plate over the cell number on the plate without antibiotic is the percentage of cells containing plasmid. Increasing the percentage of cells containing the plasmid will increase protein yield. Aeration condition, medium pH, growth temperature, and addition of extra antibiotic will all affect this percentage.
The degradation of protein is termed proteolysis. Proteolysis can be reduced by using protease-deficient host cell strains. Expressing the protein in a different cellular compartment may also reduce proteolysis. Some amino acid sequence may be related in proteolysis. For example, amino acids following start Met such as Arg, Lys, Phe, Leu, Tyr, and Trp are more susceptible to degradation than other amino acids. In eukaryote, PEST sequence (Pro, Glu, Ser, and Thr) are involved in proteolysis, but they are not important in prokaryote.
The most important factor for protein stability in expression is protein folding. Incorrectly folded protein will subject to degradation and results in low yield. Most observed protein degradations during and after protein expression are the results of expressing truncated protein domains. In many cases, the flanking amino acid sequences of an intact domain are also required for correct folding. Ten to 20 amino acids flanking the intact domain are generally sufficient for correct folding. Expressing an intact domain of the protein with necessary flanking amino acid sequences is critical to avoid protein degradation.
A recombinant protein may be fused with an amino acid tag or a fusion protein. A tag is usually less than 50 amino acids in length. A fusion tag may facilitate protein detection and purification. In addition, the amino acid sequences of most fusion tags are optimized for protein yield. A fusion protein is often highly expressed soluble protein. Many fusion proteins will also facilitate protein detection and purification. Expressing a recombinant protein with a fusion tag may increase its protein yield. Expressing a recombinant protein with a fusion protein may increase its yield, solubility, and stability.
Protein toxicity is a commonly observed phenomenon. All active proteins will perform certain functions. All these functions with few exceptions are needed by the host cells and therefore they interfere with cellular proliferation and differentiation. The appeared phenotype of the effects of these proteins to the host cells is their "toxicity". We estimated that about 80% of all soluble proteins have certain degree of toxicity to their hosts. About 10% of all proteins are highly toxic to host cells. Toxic proteins tend to slow cell growth rate, reduce cell density, and in some cases kill the host cells. Protein yields of toxic proteins are lower than the non-toxic protein. More about toxic protein cloning and expression...>
Protein solubility is mainly determined by protein folding. At the time of a protein synthesis, an appropriate amount of prosthetic group, co-factor, ligand, other protein subunits, natural partners, molecular chaperones, and its natural environments such as a cell membrane or their substitutes must be available to get the protein correctly folded. One or more of these required materials for protein folding may be depleted at high production level resulting in miss-folded insoluble protein. In addition, the cellular protein synthesis machinery seems to have difficult to handle the protein folding at high protein synthesis rate. This is especially true for bacteria and insect expression systems. The bacteria or the insect cells may simply pack the highly synthesized protein into inclusion bodies. If the protein expression level is at tens of milligrams per liter in commonly used media such as LB or TB or hundreds of milligrams in high density growth media and in a fermentor, increasing the yield further may result in some proteins insoluble although other proteins are soluble and functional at grams per liter yield. The dilemma exists for the proteins becoming insoluble at high expression level. For the applications that solubility is not important, highest expression level may be attempted available technologies. For proteins that their solubility is critical, highest yield may only be achieved by using a high density growth medium or a fermentor. Both high density growth media and fermentation increase protein production by increasing cell density. They do not affect protein synthesis rate and therefore generally will not affect protein solubility.
Technologies to improve protein yield
Developments on DNA synthesis, cloning and protein expression enable today's scientists to optimize protein yield. For an important protein with large resources, almost all above mentioned technologies may be tested. For example, a protein may be expressed in all different expression systems from bacteria, yeast, insect, to mammal. Many different cell strains or cell lines can be tested. All sequence elements of the expression vector from promoter, terminator to regulatory genes can be optimized. Tens or hundreds of different entire coding sequences of a protein may be synthesized and tested. Some of these optimizations may be contracted out to CRO companies like ours.
Some of above factors can be easily optimized such as induction time, temperature, and inducer concentration. Others may require more molecular biology manipulation. These are all standard techniques and can be performed in most molecular biology labs.
The one of the major factors affecting protein yield is protein toxicity. We estimate that less than 20% of low yield is caused by codon usage. Over 80% of low yield are caused by protein toxicity. We define that proteins interfering with cell proliferation and differentiation are toxic proteins. These proteins normally slow cell growth. Sometimes they cause cell death.
Protein toxicity is the result of protein leaky expression before induction. Transcription read-through from upstream real or cryptic promoters and insufficient transcription repression also result in leaky expression or pre-induction expression.
Some of above strategies, such as changing medium or using lacI expressing strains, can be easily achieved. Others, such as using a different vector, expressing as different domains or with fusion partner, require more molecular biology manipulation. These are all standard techniques and can be performed in all molecular biology labs.
Many strategies improving protein yield are focused on reducing pre-induction toxicity. Following strategies may be used to decrease post-induction toxicity.
Pre-induction toxicity can be completely eliminated by combination of expression vectors, cell strains and growth media. Post-induction toxicity cannot be completely overcome for certain proteins. With the strategies to reduce post-induction toxicity, sufficient amount of proteins may be expressed and purified.
Recombinant protein may contain codons that are rarely used in the expression system or cells. Insufficient amount of a particular tRNA in the expression system may result in so called codon starvation. Cellular translation machinery may pause or halt at the repetitive rare codons because of few tRNA available. Proteins with multiple repetitive rare codons especially within the first 50 amino acids of the amino terminus of the protein may significantly reduce the protein expression. Sometimes it shuts down the expression completely.
Related literatures of protein yield
Related products of protein yield
We appreciate your feedback and comments at firstname.lastname@example.org.