5.6: Putting It Together- Important Biological Macromolecules - Biology

5.6: Putting It Together- Important Biological Macromolecules - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Now that we’ve learned about the different macromolecules our bodies need and use, let’s return to our the questions we asked at the beginning of the modules about healthy diets:

Think about It

  • Is it even possible for a person to cut all carbs out of his or her diet?
  • Is it actually healthy to remove an entire class of molecules from the diet?
  • Fats and cholesterol are strictly bad—right?

[practice-area rows=”4″][/practice-area]
[reveal-answer q=”193414″]See Our Thoughts[/reveal-answer]
[hidden-answer a=”193414″]

It is, in fact, impossible to have a no carb or no fat diet. These molecules are in all cells and cells makes up what we eat. More importantly though, each of these biological macromolecules has a very important role to play. If you cut too much fat from your diet, for example, it is possible for your fat stores dip low enough that your hair will fall out!

Even the much maligned cholesterol is a requirement for a healthy body and lifestyle: without sufficient cholesterol, your body doesn’t make enough sex hormones (estrogen or testosterone depending on if you’re male or female). The trick is to make healthy choices overall without too much of any biological macromolecules—after all, there can certainly be too much of a good thing.


This is page of useful links for undergraduate and postgraduate students of the Industrial Biochemistry Programme at the University of Limerick who are studying bioinformatics (BC4957) or who utilise bioinformatics as part of their research programme. The BSc programme in Industrial Biochemistry is run within the Department of Chemical and Environmental Sciences with MSc and PhD programmes relating to the undergraduate programme in molecular microbiology, biotechnology, structural biology, biochemistry and molecular biology. Others who access the site are more than welcome to utilise it. I hope it proves useful.

A. Scientific Literature Search and Retrieval Sources including tutorials: As part of assignment work within the programme students are required to carry out literature searches and useful links to Medline, Web of Science and Google Scholar . The page offers links to Science Direct (the University of Limericks online Elsievier journal resource only accessible internally within UL). Thus it provides a resource for thesis and project work in all areas of Industrial Biochemistry, Molecular Biology and bioinformatics.

B. A Bioinformatics Gateway with informmation and links to DNA and Protein Analysis: As part of the undergraduate programme and during postgraduate work students will be involved in protein and DNA analysis. This page provides key links to the primary sequence databases and to the major analytical servers such as BLAST, Clustal … and others.

c. Links to research activity within the Pembroke Lab. This includes

From Structuralism to Functionalism

As structuralism struggled to survive the scrutiny of the scientific method, new approaches to studying the mind were sought. One important alternative was functionalism, founded by William James in the late 19th century, described and discussed in his two-volume publication The Principles of Psychology (1890) (see Chapter 1.2 for details). Built on structuralism’s concern for the anatomy of the mind, functionalism led to greater concern about the functions of the mind, and later on to behaviourism.

One of James’s students, James Angell, captured the functionalist perspective in relation to a discussion of free will in his 1906 text Psychology: An Introductory Study of the Structure and Function of Human Consciousness:

Inasmuch as consciousness is a systematising, unifying activity, we find that with increasing maturity our impulses are commonly coordinated with one another more and more perfectly. We thus come to acquire definite and reliable habits of action. Our wills become formed. Such fixation of modes of willing constitutes character. The really good man is not obliged to hesitate about stealing. His moral habits all impel him immediately and irrepressibly away from such actions. If he does hesitate, it is in order to be sure that the suggested act is stealing, not because his character is unstable. From one point of view the development of character is never complete, because experience is constantly presenting new aspects of life to us, and in consequence of this fact we are always engaged in slight reconstructions of our modes of conduct and our attitude toward life. But in a practical common-sense way most of our important habits of reaction become fixed at a fairly early and definite time in life.

Functionalism considers mental life and behaviour in terms of active adaptation to the person’s environment. As such, it provides the general basis for developing psychological theories not readily testable by controlled experiments such as applied psychology. William James’s functionalist approach to psychology was less concerned with the composition of the mind than with examining the ways in which the mind adapts to changing situations and environments. In functionalism, the brain is believed to have evolved for the purpose of bettering the survival of its carrier by acting as an information processor. [1] In processing information the brain is considered to execute functions similar to those executed by a computer and much like what is shown in Figure 2.3 below of a complex adaptive system.

Figure 2.3 Complex Adaptive System. Behaviour is influenced by information gathered from a changing external environment.

The functionalists retained an emphasis on conscious experience. John Dewey, George Herbert Mead, Harvey A. Carr, and especially James Angell were the additional proponents of functionalism at the University of Chicago. Another group at Columbia University, including James McKeen Cattell, Edward L. Thorndike, and Robert S. Woodworth, shared a functionalist perspective.

Biological psychology is also considered reductionist. For the reductionist, the simple is the source of the complex. In other words, to explain a complex phenomenon (like human behaviour) a person needs to reduce it to its elements. In contrast, for the holist, the whole is more than the sum of the parts. Explanations of a behaviour at its simplest level can be deemed reductionist. The experimental and laboratory approach in various areas of psychology (e.g., behaviourist, biological, cognitive) reflects a reductionist position. This approach inevitably must reduce a complex behaviour to a simple set of variables that offer the possibility of identifying a cause and an effect (i.e., the biological approach suggests that psychological problems can be treated like a disease and are therefore often treatable with drugs).

The brain and its functions (Figure 2.4) garnered great interest from the biological psychologists and continue to be a focus for psychologists today. Cognitive psychologists rely on the functionalist insights in discussing how affect, or emotion, and environment or events interact and result in specific perceptions. Biological psychologists study the human brain in terms of specialized parts, or systems, and their exquisitely complex relationships. Studies have shown neurogenesis [2] in the hippocampus (Gage, 2003). In this respect, the human brain is not a static mass of nervous tissue. As well, it has been found that influential environmental factors operate throughout the life span. Among the most negative factors, traumatic injury and drugs can lead to serious destruction. In contrast, a healthy diet, regular programs of exercise, and challenging mental activities can offer long-term, positive impacts on the brain and psychological development (Kolb, Gibb, & Robinson, 2003).

Figure 2.4 Functions of the Brain. Different parts of the brain are responsible for different things.

The brain comprises four lobes:

  1. Frontal lobe:also known as the motor cortex, this portion of the brain is involved in motor skills, higher level cognition, and expressive language.
  2. Occipital lobe:also known as the visual cortex, this portion of the brain is involved in interpreting visual stimuli and information.
  3. Parietal lobe:also known as the somatosensory cortex, this portion of the brain is involved in the processing of other tactile sensory information such as pressure, touch, and pain.
  4. Temporal lobe:also known as the auditory cortex, this portion of the brain is involved in the interpretation of the sounds and language we hear.

Another important part of the nervous system is the peripheral nervous system, which is divided into two parts:

  1. The somatic nervous system, which controls the actions of skeletal muscles.
  2. The autonomic nervous system, which regulates automatic processes such as heart rate, breathing, and blood pressure. The autonomic nervous system, in turn has two parts:
    1. The sympathetic nervous system, which controls the fight-or-flight response, a reflex that prepares the body to respond to danger in the environment.
    2. The parasympathetic nervous system, which works to bring the body back to its normal state after a fight-or-flight response.

    Research Focus: Internal versus External Focus and Performance

    Within the realm of sport psychology, Gabrielle Wulf and colleagues from the University of Las Vegas Nevada have studied the role of internal and external focus on physical performance outcomes such as balance, accuracy, speed, and endurance. In one experiment they used a ski-simulator and directed participants’ attention to either the pressure they exerted on the wheels of the platform on which they were standing (external focus), or to their feet that were exerting the force (internal focus). On a retention test, the external focus group demonstrated superior learning (i.e., larger movement amplitudes) compared with both the internal focus group and a control group without focus instructions. The researchers went on to replicate findings in a subsequent experiment that involved balancing on a stabilometer. Again, directing participants’ attention externally, by keeping markers on the balance platform horizontal, led to more effective balance learning than inducing an internal focus, by asking them to try to keep their feet horizontal. The researchers showed that balance performance or learning, as measured by deviations from a balanced position, is enhanced when the performers’ attention is directed to minimizing movements of the platform or disk as compared to those of their feet. Since the initial studies, numerous researchers have replicated the benefits of an external focus for other balance tasks (Wulf, Höß, & Prinz, 1998).

    Another balance task, riding a paddle boat, was used by Totsika and Wulf (2003). With instructions to focus on pushing the pedals forward, participants showed more effective learning compared to participants with instructions to focus on pushing their feet forward. This subtle difference in instructions is important for researchers of attentional focus. The first instruction to push the pedal is external, with the participant focusing on the pedal and allowing the body to figure out how to push the pedal. The second instruction to push the feet forward is internal, with the participant concentrating on making his or her feet move.

    In further biologically oriented psychological research at the University of Toronto, Schmitz, Cheng, and De Rosa (2010) showed that visual attentionthe brain’s ability to selectively filter unattended or unwanted information from reaching awareness — diminishes with age, leaving older adults less capable of filtering out distracting or irrelevant information. This age-related “leaky” attentional filter fundamentally impacts the way visual information is encoded into memory. Older adults with impaired visual attention have better memory for “irrelevant” information. In the study, the research team examined brain images using functional magnetic resonance imaging (fMRI) on a group of young (mean age = 22 years) and older adults (mean age = 77 years) while they looked at pictures of overlapping faces and places (houses and buildings). Participants were asked to pay attention only to the faces and to identify the gender of the person. Even though they could see the place in the image, it was not relevant to the task at hand (Read about the study’s findings at

    In young adults, the brain region for processing faces was active while the brain region for processing places was not. However, both the face and place regions were active in older people. This means that even at early stages of perception, older adults were less capable of filtering out the distracting information. Moreover, on a surprise memory test 10 minutes after the scan, older adults were more likely to recognize what face was originally paired with what house.

    The findings suggest that under attentionally demanding conditions, such as a person looking for keys on a cluttered table, age-related problems with “tuning in” to the desired object may be linked to the way in which information is selected and processed in the sensory areas of the brain. Both the relevant sensory information — the keys — and the irrelevant information — the clutter — are perceived and encoded more or less equally. In older adults, these changes in visual attention may broadly influence many of the cognitive deficits typically observed in normal aging, particularly memory.

    Key Takeaways

    • Biological psychology – also known as biopsychology or psychobiology – is the application of the principles of biology to the study of mental processes and behaviour.
    • Biological psychology as a scientific discipline emerged from a variety of scientific and philosophical traditions in the 18th and 19th centuries.
    • In The Principles of Psychology (1890), William James argued that the scientific study of psychology should be grounded in an understanding of biology.
    • The fields of behavioural neuroscience, cognitive neuroscience, and neuropsychology are all subfields of biological psychology.
    • Biological psychologists are interested in measuring biological, physiological, or genetic variables in an attempt to relate them to psychological or behavioural variables.

    Exercises and Critical Thinking

    1. Try this exercise with your group: Take a short walk together without talking to or looking at one another. When you return to the classroom, have each group member write down what they saw, felt, heard, tasted, and smelled. Compare and discuss reflecting on some of the assumptions and beliefs of the structuralists. Consider what might be the reasons for the differences and similarities.
    2. Where can you see evidence of insights from biological psychology in some of the applications of psychology that you commonly experience today (e.g., sport, leadership, marketing, education)?
    3. Study the functions of the brain and reflect on whether you tend toward left- or right-brain tendencies.

    Spectroscopy: Principles and Instrumentation

    In this book, you will learn the fundamental principles underpinning molecular spectroscopy and the connections between those principles and the design of spectrophotometers.

    Spectroscopy, along with chromatography, mass spectrometry, and electrochemistry, is an important and widely-used analytical technique. Applications of spectroscopy include air quality monitoring, compound identification, and the analysis of paintings and culturally important artifacts. This book introduces students to the fundamentals of molecular spectroscopy &ndash including UV-visible, infrared, fluorescence, and Raman spectroscopy &ndash in an approachable and comprehensive way. It goes beyond the basics of the subject and provides a detailed look at the interplay between theory and practice, making it ideal for courses in quantitative analysis, instrumental analysis, and biochemistry, as well as courses focused solely on spectroscopy. It is also a valuable resource for practitioners working in laboratories who regularly perform spectroscopic analyses.

    Spectroscopy: Principles and Instrumentation:

    • Provides extensive coverage of principles, instrumentation, and applications of molecular spectroscopy
    • Facilitates a modular approach to teaching and learning about chemical instrumentation
    • Helps students visualize the effects that electromagnetic radiation in different regions of the spectrum has on matter
    • Connects the fundamental theory of the effects of electromagnetic radiation on matter to the design and use of spectrophotometers
    • Features numerous figures and diagrams to facilitate learning
    • Includes several worked examples and companion exercises throughout each chapter so that readers can check their understanding
    • Offers numerous problems at the end of each chapter to allow readers to apply what they have learned
    • Includes case studies that illustrate how spectroscopy is used in practice, including analyzing works of art, studying the kinetics of enzymatic reactions, detecting explosives, and determining the DNA sequence of the human genome
    • Complements Chromatography: Principles and Instrumentation

    The book is divided into five chapters that cover the Fundamentals of Spectroscopy, UV-visible Spectroscopy, Fluorescence/Luminescence Spectroscopy, Infrared Spectroscopy, and Raman Spectroscopy. Each chapter details the theory upon which the specific techniques are based, provides ways for readers to visualize the molecular-level effects of electromagnetic radiation on matter, describes the design and components of spectrophotometers, discusses applications of each type of spectroscopy, and includes case studies that illustrate specific applications of spectroscopy.

    Each chapter is divided into multiple sections using headings and subheadings, making it easy for readers to work through the book and to find specific information relevant to their interests. Numerous figures, exercises, worked examples, and end-of-chapter problems reinforce important concepts and facilitate learning.

    Spectroscopy: Principles and Instrumentation is an excellent text that prepares undergraduate students and practitioners to operate in modern laboratories.

    10.4 (k) -means clustering

    10.4.1 Background

    (k) -means clustering is a classic technique that aims to partition cells into (k) clusters. Each cell is assigned to the cluster with the closest centroid, which is done by minimizing the within-cluster sum of squares using a random starting configuration for the (k) centroids. The main advantage of this approach lies in its speed, given the simplicity and ease of implementation of the algorithm. However, it suffers from a number of serious shortcomings that reduce its appeal for obtaining interpretable clusters:

    • It implicitly favors spherical clusters of equal radius. This can lead to unintuitive partitionings on real datasets that contain groupings with irregular sizes and shapes.
    • The number of clusters (k) must be specified beforehand and represents a hard cap on the resolution of the clustering.. For example, setting (k) to be below the number of cell types will always lead to co-clustering of two cell types, regardless of how well separated they are. In contrast, other methods like graph-based clustering will respect strong separation even if the relevant resolution parameter is set to a low value.
    • It is dependent on the randomly chosen initial coordinates. This requires multiple runs to verify that the clustering is stable.

    That said, (k) -means clustering is still one of the best approaches for sample-based data compression. In this application, we set (k) to a large value such as the square root of the number of cells to obtain fine-grained clusters. These are not meant to be interpreted directly, but rather, the centroids are treated as “samples” for further analyses. The idea here is to obtain a single representative of each region of the expression space, reducing the number of samples and computational work in later steps like, e.g., trajectory reconstruction (Ji and Ji 2016) . This approach will also eliminate differences in cell density across the expression space, ensuring that the most abundant cell type does not dominate downstream results.

    10.4.2 Base implementation

    Base R provides the kmeans() function that does as its name suggests. We call this on our top PCs to obtain a clustering for a specified number of clusters in the centers= argument, after setting the random seed to ensure that the results are reproducible. In general, the (k) -means clusters correspond to the visual clusters on the (t) -SNE plot in Figure 10.6, though there are some divergences that are not observed in, say, Figure 10.1. (This is at least partially due to the fact that (t) -SNE is itself graph-based and so will naturally agree more with a graph-based clustering strategy.)

    Figure 10.6: (t) -SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from (k) -means clustering.

    If we were so inclined, we could obtain a “reasonable” choice of (k) by computing the gap statistic using methods from the cluster package. This is the log-ratio of the expected to observed within-cluster sum of squares, where the expected value is computed by randomly distributing cells within the minimum bounding box of the original data. A larger gap statistic represents a lower observed sum of squares - and thus better clustering - compared to a population with no structure. Ideally, we would choose the (k) that maximizes the gap statistic, but this is often unhelpful as the tendency of (k) -means to favor spherical clusters drives a large (k) to capture different cluster shapes. Instead, we choose the most parsimonious (k) beyond which the increases in the gap statistic are considered insignificant (Figure 10.7). It must be said, though, that this process is time-consuming and the resulting choice of (k) is not always stable.

    Figure 10.7: Gap statistic with respect to increasing number of (k) -means clusters in the 10X PBMC dataset. The red line represents the chosen (k) .

    A more practical use of (k) -means is to deliberately set (k) to a large value to achieve overclustering. This will forcibly partition cells inside broad clusters that do not have well-defined internal structure. For example, we might be interested in the change in expression from one “side” of a cluster to the other, but the lack of any clear separation within the cluster makes it difficult to separate with graph-based methods, even at the highest resolution. (k) -means has no such problems and will readily split these broad clusters for greater resolution, though obviously one must be prepared for the additional work involved in interpreting a greater number of clusters.

    Figure 10.8: (t) -SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from (k) -means clustering with (k=20) .

    As an aside: if we were already using clusterRows() from bluster, we can easily switch to (k) -means clustering by supplying a KmeansParam() as the second argument. This requires the number of clusters as a fixed integer or as a function of the number of cells - the example below sets the number of clusters to the square root of the number of cells, which is an effective rule-of-thumb for vector quantization.

    10.4.3 Assessing cluster separation

    The within-cluster sum of squares (WCSS) for each cluster is the most relevant diagnostic for (k) -means, given that the algorithm aims to find a clustering that minimizes the WCSS. Specifically, we use the WCSS to compute the root-mean-squared deviation (RMSD) that represents the spread of cells within each cluster. A cluster is more likely to have a low RMSD if it has no internal structure and is separated from other clusters (such that there are not many cells on the boundaries between clusters, which would result in a higher sum of squares from the centroid).

    (As an aside, the RMSDs of the clusters are poorly correlated with their sizes in Figure 10.8. This highlights the risks of attempting to quantitatively interpret the sizes of visual clusters in (t) -SNE plots.)

    To explore the relationships between (k) -means clusters, a natural approach is to compute distances between their centroids. This directly lends itself to visualization as a tree after hierarchical clustering (Figure 10.9).

    Figure 10.9: Hierarchy of (k) -means cluster centroids, using Ward’s minimum variance method.

    10.4.4 In two-step procedures

    As previously mentioned, (k) -means is most effective in its role of vector quantization, i.e., compressing adjacent cells into a single representative point. This allows (k) -means to be used as a prelude to more sophisticated and interpretable - but expensive - clustering algorithms. The clusterRows() function supports a “two-step” mode where (k) -means is initially used to obtain representative centroids that are subjected to graph-based clustering. Each cell is then placed in the same graph-based cluster that its (k) -means centroid was assigned to (Figure 10.10).

    Figure 10.10: (t) -SNE plot of the PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from combined (k) -means/graph-based clustering.

    The obvious benefit of this approach over direct graph-based clustering is the speed improvement. We avoid the need to identifying nearest neighbors for each cell and the construction of a large intermediate graph, while benefiting from the relative interpretability of graph-based clusters compared to those from (k) -means. This approach also mitigates the “inflation” effect discussed in Section 10.3. Each centroid serves as a representative of a region of space that is roughly similar in volume, ameliorating differences in cell density that can cause (potentially undesirable) differences in resolution.

    The choice of the number of (k) -means clusters (defined here by the kmeans.clusters= argument) determines the trade-off between speed and fidelity. Larger values provide a more faithful representation of the underlying distribution of cells, at the cost of requiring more computational work by the second-stage clustering procedure. Note that the second step operates on the centroids, so increasing kmeans.clusters= may have further implications if the second-stage procedure is sensitive to the total number of input observations. For example, increasing the number of centroids would require an concomitant increase in k= (the number of neighbors in graph construction) to maintain the same level of resolution in the final output.


    Aldhous, P. (2008, April 7). “Boléro”: Beautiful symptom of a terrible disease. New Scientist. Retrieved from

    Amaducci, L., Grassi, E., & Boller, F. (2002). Maurice Ravel and right-hemisphere musical creativity: Influence of disease on his last musical works? European Journal of Neurology, 9(1), 75–82.

    Miller, B. L., Boone, K., Cummings, J. L., Read, S. L., & Mishkin, F. (2000). Functional correlates of musical and visual ability in frontotemporal dementia. British Journal of Psychiatry, 176, 458–463.

    Seeley, W. W., Matthews, B. R., Crawford, R. K., Gorno-Tempini, M. L., Foti, D., Mackenzie, I. R., & Miller, B. L. (2008). “Unravelling Boléro”: Progressive aphasia, transmodal creativity, and the right posterior neocortex. Brain, 131(1), 39–49.

    A Brief History

    Shortly after KSHV was discovered in KS lesions and primary effusion lymphoma (PEL) cells had been identified as a source for KS virus, reports described characteristic nuclear speckles that were observed by immunofluorescence when staining PEL cells with sera from patients who were PCR-positive for KSHV (2 ⇓ –4). Soon after, cloning and sequencing of the complete KSHV genome and identification of the major KSHV latency-associated genes, in combination with transfection experiments, revealed that LANA encoded by ORF73 is the antigen that reacts with KSHV-positive patient antisera to give rise to “LANA speckles” (5, 6). To date, detection of LANA speckles is the gold standard for KSHV diagnostics (7). LANA is a large 220- to 240-kDa nuclear protein that interacts with many host cellular proteins involved in DNA replication and transcriptional regulation (8). For this discussion, we focus on the role of LANA with respect to genome persistence during latency. The first evidence that KSHV LANA, like EBNA1 from the related human tumor virus Epstein–Barr virus (EBV), is responsible for genome segregation came in 1999, when it was demonstrated that plasmids containing TR sequences were stably segregated in cells expressing LANA (9, 10). Multiple groups identified the TR sequences as cis-regulatory elements essential for both the initiation of DNA replication and the segregation of TR-containing plasmids during mitosis. The LANA C-terminal domain was mapped and shown to bind to two LANA binding sites (LBS1 and LBS2) in a cooperative manner (11). Next, elegant structural and genetic approaches demonstrated that an 18-aa-long N-terminal peptide specifically interacts with the H2A/H2B histone interface, and that this interaction is required for episomal segregation (12). The model that arose from these molecular studies is that LANA binds to the viral TR sequences via its C-terminal DNA binding domain in a highly sequence-specific manner, while tethering viral episomes to host chromatin through interaction of the LANA N-terminal domain with histones. In other words, LANA forms a “tether” or “bridge” between viral and host chromatin. As described above, many molecular details are now known.

    The Classroom Flow: Evolution Doodling Activity

    1. As students enter the classroom, hand out highlighters/markers and 5-6 sheets of paper. Tell students they have 2 minutes to copy the animal drawing I have posted on the front whiteboard.

    • Note: The organism you choose to draw and post each class period can vary and is not important so long as it is something the students can easily draw, such as a fish.

    2. After the timer goes off, ask students to hold up their fish drawings. Walk around the room and choose one from the group with as much dramatic pause as they can stand! Don't tell students what criteria you are using in order to make you choice

    • Note: Expect lots of giggling and murmurs from the room as you walk around! My reflections that follow provide some options for you to choose from or build upon.

    3. After choosing a drawing from the student group, take the first drawing down from the whiteboard and post the second one collected from the group. Ask students to copy this drawing as best they can in 2 minutes.

    4. Repeat this process for 5-6 generations depending upon the time and conversational cues of the group.

    • Note: You will hear all kinds of spontaneous conversations to happen concerning how fish are being chosen for the next round of drawings--I try not to confirm or comment until after our activity is over.

    5. Once you have accumulated 5-6 drawings, post them up on the board in chronological order for students to view and analyze as a group. At this point, your engaged class discussion can begin.

    6.5 Removing low-quality cells

    Once low-quality cells have been identified, we can choose to either remove them or mark them. Removal is the most straightforward option and is achieved by subsetting the SingleCellExperiment by column. In this case, we use the low-quality calls from Section to generate a subsetted SingleCellExperiment that we would use for downstream analyses.

    The biggest practical concern during QC is whether an entire cell type is inadvertently discarded. There is always some risk of this occurring as the QC metrics are never fully independent of biological state. We can diagnose cell type loss by looking for systematic differences in gene expression between the discarded and retained cells. To demonstrate, we compute the average count across the discarded and retained pools in the 416B data set, and we compute the log-fold change between the pool averages.

    If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 6.5, suggesting that the QC step did not inadvertently filter out a cell type in the 416B dataset.

    Figure 6.5: Log-fold change in expression in the discarded cells compared to the retained cells in the 416B dataset. Each point represents a gene with mitochondrial transcripts in blue.

    For comparison, let us consider the QC step for the PBMC dataset from 10X Genomics (Zheng et al. 2017) . We’ll apply an arbitrary fixed threshold on the library size to filter cells rather than using any outlier-based method. Specifically, we remove all libraries with a library size below 500.

    The presence of a distinct population in the discarded pool manifests in Figure 6.6 as a set of genes that are strongly upregulated in lost . This includes PF4, PPBP and SDPR, which (spoiler alert!) indicates that there is a platelet population that has been discarded by alt.discard .

    Figure 6.6: Average counts across all discarded and retained cells in the PBMC dataset, after using a more stringent filter on the total UMI count. Each point represents a gene, with platelet-related genes highlighted in orange.

    If we suspect that cell types have been incorrectly discarded by our QC procedure, the most direct solution is to relax the QC filters for metrics that are associated with genuine biological differences. For example, outlier detection can be relaxed by increasing nmads= in the isOutlier() calls. Of course, this increases the risk of retaining more low-quality cells and encountering the problems discussed in Section 6.1. The logical endpoint of this line of reasoning is to avoid filtering altogether, as discussed in Section 6.6.

    As an aside, it is worth mentioning that the true technical quality of a cell may also be correlated with its type. (This differs from a correlation between the cell type and the QC metrics, as the latter are our imperfect proxies for quality.) This can arise if some cell types are not amenable to dissociation or microfluidics handling during the scRNA-seq protocol. In such cases, it is possible to “correctly” discard an entire cell type during QC if all of its cells are damaged. Indeed, concerns over the computational removal of cell types during QC are probably minor compared to losses in the experimental protocol.

    11.6 Further comments

    One consequence of the DE analysis strategy is that markers are defined relative to subpopulations in the same dataset. Biologically meaningful genes will not be detected if they are expressed uniformly throughout the population, e.g., T cell markers will not be detected if only T cells are present in the dataset. In practice, this is usually only a problem when the experimental data are provided without any biological context - certainly, we would hope to have some a priori idea about what cells have been captured. For most applications, it is actually desirable to avoid detecting such genes as we are interested in characterizing heterogeneity within the context of a known cell population. Continuing from the example above, the failure to detect T cell markers is of little consequence if we already know we are working with T cells. Nonetheless, if “absolute” identification of cell types is necessary, we discuss some strategies for doing so in Chapter 12.

    Alternatively, marker detection can be performed by treating gene expression as a predictor variable for cluster assignment. For a pair of clusters, we can find genes that discriminate between them by performing inference with a logistic model where the outcome for each cell is whether it was assigned to the first cluster and the lone predictor is the expression of each gene. Treating the cluster assignment as the dependent variable is more philosophically pleasing in some sense, as the clusters are indeed defined from the expression data rather than being known in advance. (Note that this does not solve the data snooping problem.) In practice, this approach effectively does the same task as a Wilcoxon rank sum test in terms of quantifying separation between clusters. Logistic models have the advantage in that they can easily be extended to block on multiple nuisance variables, though this is not typically necessary in most use cases. Even more complex strategies use machine learning methods to determine which features contribute most to successful cluster classification, but this is probably unnecessary for routine analyses.