Current Research

Past Research

Internet Of Things

The Internet of Things, Smart Cities, and Wireless Healthcare
Internet of Things Applications to Smart Grid and Green Energy
Calibration Models for Environmental Monitoring

Energy-efficient Design and Management

Event-driven Power Management
Energy-efficient software design
Energy-efficient wireless communication
IoT System Characterization and Management
Energy Efficient Ad-hoc Wireless Networks Routing and Scheduling

Efficient Hardware Design

Approximate Computing

Hyperdimensional Computing

Neuroscience has proven to be a rich source of inspiration for the machine learning community: from the Perceptron, which in-troduced a simple and general-purpose learning algorithm for linear classifiers, convolutional architectures inspired by visual cortex, to sparse coding and independent component analysis. One of the most consequential discoveries from the neuroscience community – which has underlaid much research at the intersection of neuroscience and machine learning - has been the notion of high-dimensional distributed representations as the fundamental data structure for diverse types of information. In the neuroscience context, these representations are also typically sparse. To give a concrete example, the sensory systems of many organisms have a critical component consisting of a trans-formation from relatively low dimensional sensory inputs to much higher-dimensional sparse codes. These latter representations are then used for subsequent tasks such as recall and learning. HD computing builds on this line of research by using high-dimensional randomized data representations as the basic units of computation. Typical values for the dimension of hypervectors are above 5,000. The elements of a hypervector are typically either bipolar (e.g. 0/1 or ±1) or integers. Arbitrary real numbers are generally avoided for computational reasons but are not inherently unsupported. There are two core operations in HD computing: bundling and binding. Bundling is used to compile a collection of related objects into a single representation while binding is used to form a semantic link between two objects. Because of their high-dimensionality, any randomly chosen pair of hypervectors will be nearly orthogonal with high-probability. A useful consequence of this is that bundling can be easily implemented as a sum; for a collection of vectors P,Q,V their element-wise sum S=P+Q+V is, in expectation, closer to P, Q and V than any other randomly chosen vector in the space. Thus, we can represent sets simply by summing the component vectors. Given HD representations of data, this suggests a simple scheme for classification. We can simply take the data points corresponding to a particular class and superimpose them into a single represen-tation for the set. Then, given a new piece of data for which the correct class label is unknown, we can simply compute the similarity with the hypervectors representing each class and return the label corresponding to the most similar one. The processing of generating HD representations from low-dimensional data is known as ``encoding'' and is an active area of research in our group. We have developed novel techniques for encoding continuous data and generating quantized and sparse representations which are still effective for learning.

Advantages of HD Computing
There is a wide, and rapidly growing, body of literature applying HD to practical learning problems. In general, this literature has found several desirable properties of algorithms based on the HD computing paradigm:

(1) HD Requires Fewer Training Examples to Learn: Obtaining labeled training data is often costly and time consuming, requiring tedious annotation by an expert. Thus, it is generally desirable for an algorithm to learn from as few labeled examples as possible. Work from our group in found that an HD based classification algorithm for speech recognition required only 40% as much labeled data to learn to the same level of accuracy as a deep neural network. This parallels similar findings from other groups in different applications of HD computing.

(2) HD Representations are Interpretable: State-of-the-art machine learning algorithms like deep neural networks and random forests are often highly complex and difficult to interpret by humans. This reduces user trust in the predictions generated by algorithms. By contrast, HD algorithms are extremely simple, boiling down to comparing the similarity between the high-dimensional embedding of the query and training data. Furthermore, the encoding process used to obtain HD representations is refreshingly simple and invertible - meaning we can recover the original data from its HD encoding. Recent work from our group in explores methods for decoding HD representations with applications to secure distributed learning.

(3) HD Representations are Robust to Noise: Because HD representations distribute information uniformly over a large number of coordinates HD computing is highly robust to noise. This property is useful in low-power devices and on emerging architectures which suffer from higher levels of noise in computing.

Case Study: Using HD Computing for DNA Sequence Alignment
Recent work from our group has levereaged HD to accelerate analysis of genetic data. Sequence alignment is a crucial step in the analysis of genetic data and can be used to study the genetic determinants of disease and the evolutionary history of organisms among many other applications. Modern algorithms typically partition one of the sequences into short segments and then search for regions of high similarity in the other "reference" genome. While conceptually simple, this initial process of identifying regions of high-similarity between a DNA strands remains computationally demanding due to the sheer scale of data involved – the reference genome for homo-spaians consists of over 3.2 billion base-pairs! Work by our group in uses HD computing to accelerate this costly similarity search phase of sequence alignment. Using an optimized FPGA implementation we obtained a 44.4x speed improvement and 54.1x better energy efficiency than a state-of-the-art non-HD FPGA implementation. Compared to a modern GPU implementation these numbers were an even more impressive 122x and 707x respectively.

Processing in Memory

We live in a world where technological advances are continually creating more data than what we can cope with. With the emergence of the Internet of Things, sensory and embedded devices will generate massive data streams demanding services that pose huge technical challenges due to limited device resources. Even with the evolution of processor technology to serve computationally complex tasks in a more efficient way, data movement costs between processor and memory are the major bottleneck in the performance of most applications. We seek to perform hardware/software co-design of novel hybrid processing in-memory platforms that accelerate fundamental operations and diverse data analytic procedures using processing in-memory technology (PIM). PIM enables in-situ operations, thereby reducing the effective memory bandwidth utilization. In the hardware layer, the proposed platform has a hybrid structure comprising of two units: PIM-enabled processors and PIM-based accelerators. The PIM-enabled processors enhance traditional processors by supporting fundamental block-parallel operations inside processor cache structure and associated memory, e.g., addition, multiplication, or bitwise computations. To fully get the advantage of PIM for popular data processing procedures and machine learning algorithms, we also design specialized PIM-based accelerator blocks. To deliver a seamless experience for application developers, we also develop a software infrastructure that provides abstracted interfaces corresponding to the PIM-enabled memory and accelerators.
Our solutions can process several applications including deep neural network training and inference, graph processing, brain-inspired hyperdimensional computing, multimedia applications, query processing in databases, bioinformatics, and security entirely in-memory. The proposed system, which integrates all optimized software components with improved hardware designs, can bring more than 10x speedup and at least 100-1000x improvement in energy efficiency. Our solutions ensure that the loss of accuracy when running applications with realistic data sets is kept small enough to not be perceivable, thus meeting the user’s quality requirements.
Over the past few years, the group has been actively working to answer the WHY, the WHERE, the WHEN, and the HOW of processing in memory. We are redesigning memory, all the way from systems to architecture down to the low-level circuits, to enable PIM for various applications like machine learning, bioinformatics, data analytics, and graph processing. Recently, we proposed FloatPIM, a highly-parallel and flexible architecture that implemented high precision training and testing of neural networks entirely in memory. The design was flexible enough to support both fixed and floating-point operations and provided stable training of complex neural networks. Such PIM-based architectures have shown multiple orders of magnitude of improvement in performance as well as energy efficiency.

Bioinformatics Acceleration

Outpacing Moore’s law, genomic data is doubling every seven months and is expected to surpass YouTube and Twitter by 2025. Starting with sequencing and alignment, bioinformatics run a variety of algorithms on such big data from variant calling to classification to graph-based analysis for different objectives, e.g., to understand the disease-causing mutilations, personalized treatment, and protein harvesting drug production, just to name a few. Memory/storage and computation requirements of these applications extend from hundreds of CPU hours and gigabytes of memory, to millions of CPU hours and petabytes of storage. This tremendous amount of data entails redesigning the entire system stack with the goal of superior architectural solutions for memory/storage systems, e.g., intelligent use of high-bandwidth memory granted by advances in hardware, as well as significantly faster computation platforms for expedited decision to enable clinical use of technology in real-time, e.g., to shrink precision microbiome from its three-month per individual to a few hours through hardware-software co-design that supports and optimizes for data-intensive processing. To achieve that, we jointly put together our experience on microbiome algorithms and datasets, along with innovative hardware design including processing-in-memory (PIM) acceleration of alignment, clustering and classification as well as high-bandwidth FPGAs and GPUs and near-data accelerators to develop a full-stack infrastructure to map the aforementioned bioinformatics applications to novel hardware in an end-to-end manner, meanwhile rethinking on the design and implementation of novel algorithmic alternatives driven by the new hardware infrastructure, e.g., mapping applications on hyperdimensional computing paradigm which, thanks to its error tolerance, can benefit from novel technologies such as multi-level memory which is suitable to store DNA data.

Case Study: Sequence Alignment using PIM
The global sequence alignment can be formulated as finding the optimal edit operations, including deletion, insertion, substituting of the base-pairs required to transform sequence x to sequence y. The search space of evaluating all possible alignments is exponentially proportional to the length of the sequences and becomes computationally intractable. To resolve this, the Needleman-Wunsch algorithm employs dynamic programming and reduces the worst-case performance and space down to quadratic. Parallelized versions of Needleman-Wunsch rely on the fact that computing the elements on the same diagonal of the scoring matrix requires only the elements of the previous diagonal. The level of parallelism offered by large sequence lengths cannot be effectively exploited by conventional processor architecture. We proposed RAPID, a ReRAM-based digital PIM architecture to accelerate global sequence alignment. RAPID consists of multiple computational units connected via an H-tree structure which allows low-latency transfers between adjacent units. Each unit comprises a block for the main scoring sub-matrix and two smaller blocks for back-track information. The units collectively store database sequences or reference genome and perform the scoring. For maximum efficiency, RAPID evenly distributes the stored sequence among the units. RAPID takes in a query sequence and finally outputs details of the required insertions and deletions in the form of traceback information. An iteration of RAPID evaluates one diagonal of the substitution or the alignment matrix. On aligning chromosome-1 of human genome with that of a chimpanzee (477M-diagonal matrix), RAPID is 11.8x faster and 2820x more energy efficient than a cluster of 384 GPUs that run CUDAlign multi-GPU platform.

Case Study: De novo Assembly using de Bruijn Graphs (DBGs)
DBGs are the core of so-called de novo assembly that pieces short DNA reads together to construct the original genome. Processing DGBs is extremely challenging because of the huge data, hence, graph size. Algorithms on DGB require an excessive amount of linked list traversals, pointer chasing, and hash-table lookups, which are inefficient on compute-centric systems. In this project, we seek a novel architecture that utilizes emerging in-storage and in-memory processing technologies to accelerate DGB assemblers to tackle challenges of parallelism and memory throughput. As the initial milestones, we investigate state-of-the-art DGB-based methods and recognize critical operations in three processing phases: graph construction, graph cleaning, and sequence assembly. Next, we design a software-hardware solution to effectively utilize processing capability in different system hierarchies. Specifically, we employ a pre-processing algorithm to recognize and fix erroneous short reads to shrink the size of graph by extending our RAPID-based in-memory accelerated read alignment for short reads in case of having a reference genome, or an in-memory hash-table construction (when no reference genome exists) in the next steps. We aim to split and distribute the DGB over multiple memory vaults for algorithmic parallelism, where each sub-graph further enjoys accelerated pruning and traversing operations.

Case Study: Real-Time Phylogenetic Inference and Transmission Cluster Analysis of COVID-19
The standard viral phylogenetic inference workflow consists of quality checking and filtering, multiple sequence alignment, phylogenetic inference, phylogenetic rooting, phylogenetic dating, and transmission clustering. The researchers have identified that the computational bottlenecks of the workflow are multiple sequence alignment and phylogenetic inference, which scale poorly as a function of the number of input sequences. The objective of this project is the development of a user-friendly, scalable, and modular workflow for conducting a real-time computational phylogenetic analysis of assembled viral genomes, with a primary focus of SARS-CoV-2. The project solution includes: (1) the development of a novel software tool for orchestrating the automated end-to-end workflow, (2) the development of novel algorithms (and software implementations of these algorithms) to speed up the computational bottlenecks of the workflow, (3) the development of novel hardware systems for accelerating the workflow, and (4) a real-time publicly-accessible repository in which researchers can access the most up-to-date analysis results (with intermediate files) of all SARS-CoV-2 genomes currently available to prevent repeat computation efforts. The analysis infrastructure that will be built in this project will be broadly applicable to any viral pathogen for which phylogenetic inference is biologically and epidemiologically meaningful.

IoT Management and Reliablitly

The Internet of Things is a growing network of heterogeneous devices, combining commercial, industrial, residential and cloud-fog computing domains. These devices range from low-power sensors with limited capabilities to multi-core platforms on the high-end. The common property for these devices is that they age, degrade and eventually require maintenance in the form of repair, component replacement or complete device replacement. In general, power dissipation on devices makes the temperature rise, which in turn creates temperature stress that dramatically increases the impact of reliability degradation mechanisms leading to early failures. To analyze the effects of reliability degradation in IoT networks, we implemented a reliability framework. Utilizing this framework, we are able to explore trade-offs between energy, performance, and reliability. Currently the framework works with established models for servers, gateways and edge devices, which are obtained by fitting the parameters into our characterization results.
We work on “maintenance preventive” dynamic control strategies of the IoT devices to minimize the often ignored costs of maintenance. Our work has already demonstrated the importance of dynamic reliability management in mobile systems, by controlling the frequency, voltage and core allocations, while respecting user experience constraints. Our current research goal is to extend this into the whole IoT domain. Initially, we showed that the battery health of IoT devices can be improved with reliability-aware network management . We further propose optimal control strategies for diverse devices by adjusting their sampling rates, communication rates or frequency and voltage levels. The distributed solutions work towards limiting maintenance costs while keeping data quality within desired levels. We also develop smart path selection and workload offloading algorithms that ensure a balanced distribution of reliability across the network. The combined approach is to minimize operational and expected maintenance costs in a distributed and scalable fashion while respecting the user and data quality constraints imposed by the end-to-end IoT applications. In the future, we will validate our approaches on a large-scale sensor network testbed as part of the HPWREN setup.

Trajectories for Persistent Monitoring

Traditionally, environmental phenomena have been measured using stationary sensors configured into wireless sensor networks or through participatory sensing by user-carried devices. Since the phenomena are typically highly correlated in time and space, each reading from a stationary sensor is less informative than that of a similarly capable sensor on the move. User-carried sensors can take more informative readings, but we have no control over where the sensors travel.

Our work in Trajectories for Persistent Monitoring helps to close this gap by optimizing the path that robotic platforms travel to maximize the information gained from data samples. Multi-objective goals are formed using information gain and additional goals, such as system responsiveness to dynamic points of interest, multi-sensor fusion, and information transfer using cognitive radios. The resulting robots can adapt to dynamic environments to rapidly detect evolving wildfires, support first responders in emergency situations, and collect information to improve air quality models for a region.

Past Research
Approximate Computing

Today’s computing systems are designed to deliver only exact solutions at a high energy cost, while many of the algorithms that are run on data are at their heart statistical, and thus do not require exact answers. However, the solutions to date are isolated to only a few of the components in the system stack. The real challenge arises when developers want to employ approximation across multiple layers simultaneously. Much of the potential gains are not realized since there are no system-level solutions.
We develop novel architectures with software and hardware support for approximate computing. Hardware components are enhanced with the ability to dynamically adapt approximation at a quantifiable and controllable cost in terms of accuracy. Software services complement hardware to ensure the user’s perception is not compromised while maximizing the energy savings due to approximations. The changes to hardware design include approximation-enabled CPU and GPU. GPU is enhanced with a small associative memory placed close to each stream core. The main idea of the approximation is, instead of accurate computing on existing processing units, to return pre-computed results from the associative memory, not only for perfect matches of operands but also for inexact matches. CPUs are designed with a set of small size associative memories next to each core. We also presented the nearest distance associative memory, which returns the precomputed result with the nearest distance to the input words on each search. The data in associative memory can replace normal CPU execution when a match is close enough. Such inexact matching is subject to a threshold that is set by the software layer.
Approximate computing solutions provide up to 10x improvements to both acceleration and energy-delay-product for various applications including image processing, CUDA benchmarks, and several machine learning algorithms, while incurring acceptable errors. Further, as the maximum allowable error per operation is increased, the performance also increases. For some applications like k-means, 91% of the operations can be approximated using our solution, resulting in only 1.1% classification error compared to one run on exact hardware only. hardware implementation and high computation energy cost are the main bottlenecks of machine learning algorithms in big data domain. We search for alternative architectures to address the computing cost and memory movement issues of traditional cores.

The Internet of Things, Smart Cities, and Wireless Healthcare

In an increasingly informed world, generating and processing information encompasses several computing domains from embedded systems in smart appliances to datacenters powering the cloud. We have worked on efficient distributed data collection and aggregation for processing the data in a hierarchical, context-focused manner. By using hierarchical processing, systems can distill relevant information, increase privacy, and optimize communication energy for Smart Cities, Data Centers, and distributed Smart Grid and Healthcare applications.

Calibration Models for Environmental Monitoring

Sensor nodes at the edge of the Internet of Things often require sensor-specific calibration functions that relate input features to a phenomenon of interest. For example: in air quality sensing, the calibration function transforms input data from onboard sensors to target pollutant concentrations, and for application power prediction, internal performance metrics can be used to predict device power. Edge devices are typically resource constrained, meaning that traditional machine learning models are difficult to fit into the available storage and on-device training can strain available processing capabilities. We seek novel methods of reducing the complexity of training machine learning models on the edge by efficiently reducing training datasets, focusing calibration efforts into important regions using application-specific loss functions, and improving regression methods for resource-constrained devices.

The Internet of Things with Applications to Smart Grid and Green Energy

The emergence of the Internet of Things has resulted in an abundance of data that can help researchers better understand their surroundings and create effective and automated actuation solutions. Our research efforts on this topic target several problems: (1) renewable energy integration and smart grid pricing in large scale systems, (2) individual load energy reduction and automation and (3) improved predictions mechanisms for context-aware energy management that leverage user activity modeling.
We have designed and implemented multiple tools that span from individual device predictors to a comprehensive representation of this vast environment.

Wireless Healthcare

With the proliferation of personal mobile computing via mobile phones and the advent of cheap, small sensors, we propose that a new kind of "citizen infrastructure", can be made pervasive at low cost and high value. Though challenges abound in mobile power management, data security, privacy, inference with commodity sensors, and "polite" user notification, the overriding challenge lies in the integration of the parts into a seamless yet modular whole that can make the most of each piece of the solution at every point in time through dynamic adaptation. Using existing integration methodologies would cause components to hide essential information from each other, limiting optimization possibilities. Emphasizing seamlessness and information sharing, on the other hand, would result in a monolithic solution that could not be modularly configured, adapted, maintained, or upgraded.

IoT System Characterization and Management: from Data Centers to Smart Devices and Sensors

The Internet of Things is a growing network of heterogeneous devices, combining commercial, industrial, residential and cloud-fog computing domains. These devices range from low-power sensors with limited capabilities to multi-core platforms on the high-end. The IoT systems creates both new opportunities and challenges in several different domains. The abundance of data helps researchers to better understand their surroundings and create automated solutions which effectively model and manage diverse constrained resources in IoT devices and networks, including power, performance, thermal, reliability and variability. SEELab's research efforts on this topic target to solve these problems, including renewable energy integration in large scale systems, individual load energy reduction and automation, energy storage, context-aware energy management for smart devices, user activity modeling, smart grid pricing and load integration. To solve these problems, we design and implement multiple tools that not only model and analyze smaller individual pieces but also create a comprehensive representation of this vast environment.


Long-term research requiring high-resolution sensor data need platforms large enough to house solar panels and batteries. Leveraging a well-defined sensor appliance created using Sensor-Rocks, we develop novel context-aware power management algorithms to maximize network lifetime and provide unprecedented capability on miniaturized platforms.

Energy Efficient Routing and Scheduling For Ad-Hoc Wireless Networks

In large-scale ad hoc wireless networks, data delivery is complicated by the lack of network infrastructure and limited energy resources. We propose a novel scheduling and routing strategy for ad hoc wireless networks which achieves up to 60% power savings while delivering data efficiently. We test our ideas on a heterogeneous wireless sensor network deployed in southern California - HPWREN.


SHiMmer is a wireless platform that combines active sensing and localized processing with energy harvesting to provide long-lived structural health monitoring. Unlike other sensor networks that periodically monitor a structure and route information to a base station, our device acquires data and processes it locally before communicating with an external device, such as a remote controlled helicopter.

Event-driven Power Management

Power management (PM) algorithms aim at reducing energy consumption at the system-level by selectively placing components into low-power states. Formerly, two classes of heuristic algorithms have been proposed for power management: timeout and predictive. Later, a category of algorithms based on stochastic control was proposed for power management. These algorithms guarantee optimal results as long as the system that is power managed can be modeled well with exponential distributions. Another advantage is that they can meet performance constraints, something that is not possible with heuristics. We show that there is a large mismatch between measurements and simulation results if the exponential distribution is used to model all user request arrivals. We develop two new approaches that better model system behavior for general user request distributions. These approaches are event driven and give optimal results verified by measurements. The first approach is based on renewal theory. This model assumes that the decision to transition to low power state can be made in only one state. Another method we developed is based on the Time-Indexed Semi-Markov Decision Process model (TISMDP). This model allows for transitions into low power states from any state, but it is also more complex than our other approach. The results obtained by renewal model are guaranteed to match results obtained by TISMDP model, as both approaches give globally optimal solutions. We implemented our power management algorithms on two different classes of devices and the measurement results show power savings ranging from a factor of 1.7 up to 5.0 with insignificant variation in performance.

Energy-efficient software design

Time to market of embedded software has become a crucial issue. As a result, embedded software designers often use libraries that have been preoptimized for a given processor to achieve higher code quality. Unfortunately, current software design methodology often leaves high-level arithmetic optimizations and the use of complex library elements up to the designers' ingenuity. We present a tool flow and a methodology that automates the use of complex processor instructions and pre-optimized software library routines using symbolic algebraic techniques. It leverages our profiler that relates energy consumption to the source code and allows designers to quickly obtain energy consumption breakdown by procedures in their source code.

Energy-efficient wireless communication

Today’s wireless networks are highly heterogeneous with diverse range requirements and QoS. Since the battery lifetime is limited, power management of the communication interfaces without any significant degradation in performance has become essential. We show a set of different approaches that efficiently reduce power consumption under different environments and applications.When multiple wireless network interfaces (WNICs) are available, we propose a policy to decides what WNIC to employ for a given application and how to optimize the its usage leading to a large improvement in power savings. In the case of client-server multimedia applications running on wireless portable devices, we can exploit the server knowledge of the workload. We present a client- and a server-PM that by exchanging power control information can achieve more than 67 % with no performance loss. Wireless communication represents a critical aspect also in the design of specific applications such as distributed speech recognition in portable devices. We consider quality-of-service tradeoffs and overall system latency and present a wireless LAN scheduling algorithm to minimize the energy consimption of a distributed speech recognition front-end.