This brief paper reports on a study of the feasibility of 8-bit precision in analog, in-memory computation of Multiply-Accumulate (MAC) sums, as proposed for implementation of efficient AI/ML hardware. With conventional read-out means and plausible memory array parasitic assumptions, cross-talk renders the 8-bit goal unattainable. In contrast, with the addition of spatial filter circuitry, results show MAC sum read-out can be made insensitive to parasitic cross-talk.
2. Tagmatech LLC Brief: Cross-Talk Control with Analog In-Memory-Compute for AI Applications Page 2 of 6
0.5 nA, per Mahmoodi and Strukov [3], and 16 word-lines are
considered active, resulting in a possible ISUM current range
from 8nA to 256 times that (~2 micro-amps). In simulation,
arrayed ISUM current values are assigned according to a
pseudo-random sequence generated by the open-source AWK
utility available for Unix and other operating systems. All
simulation cases used the same pseudo-random sequence. As
in [7], the array was modeled as 32 bit-lines that wrap around
at the boundaries. That models bit-lines far away from any
terminations such as ground metal. In a practical design,
physical terminations and discontinuities would be modeled
and accounted for in any spatial filter calculation, but here,
proof-of-concept was the primary objective.
Note that, for this study, all bit-lines are active in every
cycle. That affords the maximum possible information
bandwidth from the array, without wasting energy or space
for active damping or shielding, while avoiding decode and
physical layout complexities associated with those strategies.
Effective spatial filtering makes shielding superfluous and
even counter-productive.
III. READ-OUT TIMING SEQUENCE
In operation, the Read-Out timing sequence of the MAC
array resembles that of the ABLV mode in [7]. Initially, all
bit-lines are reset to a fixed, regulated voltage level. In this
case, that’s one volt (Conveniently, one volt happens to be the
nominal operating voltage for the PTM transistor models used
here). Then word-lines go active, causing the ISUM currents
to flow. Bit-line levels are allowed to fall for a predetermined
period of time, with each falling at a rate determined by its
associated ISUM value and parasitic coupling effects. For this
study, the time period is that required to discharge all bit-lines
to 0.1 volt when all ISUMs are set to their maximum possible
value (that’s the value representing an 8-bit digital 255). At
the end of that period, bit-line levels are captured and the
charge held in the Read-Out circuit while it does its work.
Whether simulating the no-filter case or a spatial-filter case,
the Read-Out circuit acts as a comparator that tests bit-line
levels against an OFFSET voltage input. The applied OFFSET
voltage is calibrated to map to 8-bit digital ISUM values. For
example, ideally, when an ISUM value in the array is set to
represent a digital value of 123, the corresponding
comparator’s logic output goes high if the applied OFFSET
input is equal to or less than that representing 123. In the no-
filter case, the behavior is that of a simple comparator. In the
spatial-filter cases, the comparison also includes analog
arithmetic that implements the spatial filter.
Fig 2 shows the timing example of BL(14) and its two
nearest neighbors for the both the Short Bit-Line and Long
Bit-Line cases. In this example, the ISUM of BL(14) codes a
digital value of 4, while ISUMs at BL(13) and BL(16) code
digital values of 45 and 167 respectively. The applied
OFFSET was that mapped to the digital SUM of 4. Despite the
cross-talk-coupled influence of neighbors having higher
digital values and, therefore, stronger ISUM currents, the
OUT(14) logic level transitioned correctly. Note that the Long
and Short Bit-Line cases differ in OUT delay due to
correspondingly different filter parameters and array
dynamics, but trip at the same, correct digital value.
Fig 3 shows the mapping of applied OFFSET to 8-bit digital
SUM values for Analog-to-Digital conversion, after an
empirical calibration adjustment. Ideally, given the
predetermined operating range of bit-line voltage, digital
Nov 24, 2019
Fig 1: (a) Bit-Line Physical Model, (b) MAC Memory Array
Electrical Concept
Fig 2: Read-Out Timing Example, Short or Long Bit-Lines,
OUT(14) shown Tripping Correctly with a small SUM,
independent of larger SUMs at adjacent Bit-Lines
3. Tagmatech LLC Brief: Cross-Talk Control with Analog In-Memory-Compute for AI Applications Page 3 of 6
values from 0 to 255 would be mapped linearly between 1.0
and 0.1 volts of OFFSET. However, for ultimate precision,
calibration is required. That is because, in the digital filter
design calculation, OFFSET collects the error from the
engineering approximation that minimizes filter terms (see [7]
for elaboration). OFFSET calibration can also correct for
subtle biases in practical circuitry, such as net charge injection
by transistors in switched-capacitor circuits. However, for this
study, as shown in Fig 3, the calibrated mapping actually
deviated very little from the ideal.
IV. CIRCUIT IMPLEMENTATION
In its idealized form shown in Fig 4(a), the filter circuit for
this study is identical to the example of the previous paper [7].
The CMOS implementation in Fig 4(b) is also similar, but
includes additional gain to accommodate the higher precision
required in the MAC application. Whereas the gain
requirement of the single-level Read-Out example could be
met with a single cascode stage, the MAC array here uses two
in series, with each configured to provide the square-root of
the required total gain. Cascode-style gain elements are
advantageous in this application, due to the relative ease of
physically fitting such a structure into the bit-line pitch of a
memory array.
Fig 5 shows an additional detail of the CMOS circuit
implementation. Whereas the single-level Read-Out example
[7] performed adequately with text-book style CMOS transfer
gates for sample/weight capacitor switches, the higher
precision of the MAC application requires more uniform
conductance and parasitic capacitance during the bit-line
discharge phase of the Read-Out operation. That need was met
by using n-channel transistors operating with a constant Vgs
during the critical period. For the low-side (pull-down)
devices, only logic level gate drive was required. To switch on
the high-side devices, the gate drive needed to be boosted
above the source and remain at substantially the same bias
relative to the source during the bit-line discharge. Typically,
such a requirement is met by employing a dynamic circuit
technique, charging the gate with an initial pulse and allowing
parasitic gate capacitance to hold the gate-source bias voltage
thereafter. For this study, the convenience of a voltage-
controlled-voltage source was used to simulate dynamic high-
side gate driver circuitry.
V. DETERMINING SPATIAL FILTER PARAMETERS
Though tedious, it can be instructive to determine optimal
filter parameters by empirical means. Given an array model
whose coupling parasitics predominately act between nearest
neighbor bit-lines, one could reasonably anticipate using a
filter topology having only three bit-line inputs, as in Fig 4(a).
Nov 24, 2019
Fig 3: Analog-to-Digital Conversion by Mapping Applied
OFFSET Voltage to Digital SUM (Shows Mapping after
Calibration)
Fig 4: (a), Idealized Spatial Filter Circuit for Read-Out,
(b), Practical Circuit in CMOS
Fig 5: Capacitor Switching Detail, Constant Vgs
4. Tagmatech LLC Brief: Cross-Talk Control with Analog In-Memory-Compute for AI Applications Page 4 of 6
Starting from there, an empirical filter design process might
go as follows: (1), Choose a reasonably acceptable total
capacitance that the filter circuit may present to the bit-lines.
In the present study, that happens to be 0.5 pf. (2), Given any
arbitrary, random distribution of stored states in the memory
array, experimentally adjust the allocation of that capacitance
among C1, C2 and C3, while keeping C1 equal to C3. Choose
the apportionment that minimizes error induced by
neighboring bit-line signals. (3), Adjust the gain by tuning the
value of C5 to obtain an OUT signal amplitude equal to a
logic signal swing. In the present case, that’s one volt. (4),
Adjust the value of C4 such that the OFFSET signal level is
equal to the logic high level when looking at a bit-line whose
stored state represents a digital zero.
A more general and labor efficient, but also more abstract
method is outlined in [7]. Given the spatial impulse response
of the array as input to the calculation, it uses Discrete Fourier
Transforms to calculate the coefficients of an inverse spatial
filter while minimizing bit-line terms and allocating the
residuals to the OFFSET term. There need be no initial
presumption about the physical reach of coupling effects-- the
procedure shows how many bit-line filter terms are significant.
Though nearest neighbors are often the dominant cross-talk
contributors, coupling can extend further, in principle. For this
study, the more abstract approach was followed. As in the
referenced prior example, Octave [14] was used to do the
analysis, including the arithmetic adjustment of C4 to obtain
the desired operating voltage range for the OFFSET input.
In this case, the input spatial impulse response reflects the
very small data state contrasts and varying levels that the
Read-Out circuit is required to handle. See the examples in Fig
6 showing the bit-line levels after being discharged for the
predetermined time period, given a digital zero state at BL(16)
and digital one states on the remainder of the bit-lines. When
run against that input, the Octave calculation produces a set
values for C1 through C7 along with the OFFSET level
suitable for Read-Out of a digital zero. A second spatial
impulse response was then run, this time having a stored state
pattern with a digital 254 at BL(16) and a digital 255 on the
remainder. The shape of the impulse response was the same,
but the level was lower, bottoming out near the 0.1 volt target
for a solid 255 pattern. The resulting capacitor values were the
same as from the first run, but the OFFSET level was lower,
representing that required to detect a digital 254. Given those
two OFFSET data points and the assumption of system
linearity, a straight-line mapping of OFFSET to stored digital
state was determined. As previously explained, further small
empirical adjustments were made with respect to OFFSET
slope and zero-intercept values, consistent with anticipated
effects of the simplifying approximation made at filter design-
time, along with any subtle circuit non-idealities.
Filter capacitor values are summarized in Table 1. Note the
sum of C1, C2 and C3 is 0.5 pf in all three cases. C4 varies as
needed for a maximum OFFSET level of one volt. C5, C6 and
C7 collectively implement the required filter gain, with each
cascode stack providing the square root of the total. To
simulate a precise, technology-agnostic comparator in the
“No-Filter” case, the circuit of Fig 4(a) was implemented with
ideal switches and a voltage-dependent-voltage source for
gain, with C1 and C3 set to zero.
Table 1: Calculated Spatial Filter Capacitor Values,
compared to No-Filter
(fF) C1 C2 C3 C4 C5 C6 C7
No Filter - 500 - 500 1.76 - -
Short BL 46 408 46 314 7.15 150 23.2
Long BL 106 288 106 75.6 5.32 150 7.54
VI. FULL ARRAY SIMULATION WITH RANDOMLY DISTRIBUTED
DATA STATES
Fig 7 displays the summary results of the Read-Out
simulations with and without the spatial filter, with both short
and long bit-lines, with randomly distributed data states
assigned to the array’s ISUM current sinks. The “SUM Read-
Out” plots show the raw result of the Analog-to-Digital
conversion. The “Summing Error” plots show the difference
between that raw result and the intended result. To more
densely populate the scatter plots and thereby provide a clearer
visual impression than is possible with a single pass through
the 32-bit-line array model, four passes were done. A different
random ISUM distribution was assigned at each pass. Each
plot condition, therefore, has 128 points displayed. That
provides a good visual impression of the statistics of scatter
and skew.
What is apparent in the “No-Filter...” cases is that scatter and
skew both increase as coupling ratio (associated with bit-line
length here) increases. Scatter is attributable to cross-talk from
neighboring data values that can be above or below the
intended value. The effect is a loss of resolution. Skew is
attributable to the tendency of extreme values to be pulled
toward the mean by being coupled to less extreme neighboring
values. That may be thought of as a compression of usable
dynamic range. The short bit-line case exhibits enough scatter
to have lost the equivalent of several bits of digital resolution,
Nov 24, 2019
Fig 6: Array Spatial Impulse Response Examples, Inputs for
Inverse Filter Calculation
5. Tagmatech LLC Brief: Cross-Talk Control with Analog In-Memory-Compute for AI Applications Page 5 of 6
falling well short of the 8-bit goal. The short bit-lines also
appear to have suffered a loss of about a quarter of their
dynamic range. The long bit-line case is clearly much worse in
both ways. From the plots, it is not difficult to imagine the
limiting case of very long bit-lines where the dynamic range is
nil and the Read-Out is only random scatter containing no
usable information. At that point, there would be no way an AI
system could code and recover meaningful data.
In contrast, the “Filter...” cases, short and long, appear
flawless at the scale of the Fig 7 plots. Only by inspection of
the raw numerical data was it possible to identify
imperfections consisting of an occasional error of a count or
two, plus or minus. No systematic bias was readily
discernible. Though potential subtle effects of circuit non-
ideality cannot be discounted without closer scrutiny, a
suspicion is that the simulator’s inherent numerical limitations
may have played a role in the missed counts.
VI. CONCLUSIONS
While the quantitative realism of this study’s coupled array
model may be debated in terms of parasitic capacitance
estimation and the details of the hypothetical interconnect
technology, there are important conclusions that can be drawn
from the results. One is that modest levels of array cross-talk
prevent attainment of 8-bit equivalent resolution or anything
close to it. Another is that cross-talk mitigation by inverse
spatial filtering can be very effective, even in the presence of
strong cross-talk.
A further conclusion is that the proposed spatial filter can be
implemented in low voltage CMOS technology using
switched-capacitor (SC), voltage-mode Read-Out methods. SC
brings transistor parameter independence and low voltage
compatibility that is not as easily attained in CMOS with
current-mode methods.
Capacitive parasitics are often the primary cause of array
cross-talk, as in this study. However, the illustrated approach
should be equally effective correcting cross-talk due to
resistive ground or return current paths shared by proximate
memory cells, as in NOR-type architectures and others. In
fact, error due to combinations of various possible linear
coupling mechanisms should be simultaneously correctable
without necessarily complicating the spatial filter circuit
topology. With such utility, the approach should be effective
in adapting a wide variety of memory technologies for in-
memory MAC applications.
Nov 24, 2019
Fig 7: Read-Out of Randomly Distributed SUMs, With and Without the Spatial Filter, with Differing Bit-Line Lengths (and,
therefore, Differing Coupling Ratios and Cross-Talk Strengths)
6. Tagmatech LLC Brief: Cross-Talk Control with Analog In-Memory-Compute for AI Applications Page 6 of 6
References
[1] L. Fick, D. Blaauw, D. Sylvester, S. Skrzyniarz, M. Parikh and D. Fick,
“Analog In-Memory Subthreshold Deep Neural Network Accelerator,” 2017
IEEE Custom Integrated Ciruits Conference
[2] Mike Demler, The Linley Group, “Mythic Multiplies in a Flash,”
Microprocessor Report, August 27, 2018, https://www.mythic-ai.com/wp-
content/uploads/2018/08/Mythic-Multiplies-In-A-Flash.pdf, Downloaded 11-
November-1019
[3] M. Reza Mahmoodi, Dimitri Strukov, “An Ultra-Low Energy Internally
Analog, Externally Digital Vector-Matrix Multiplier Based on NOR Flash
Memory Technology,” DAC 2019,
https://www.ece.ucsb.edu/~strukov/papers/2018/dac2018.pdf, Downloaded
11-November-2019
[4] F. Merrikh Bayat, X. Guo, H. A. Om’mani, N. Do, K.K. Likharev, D.B.
Strukov, “Redesigning Commercial Floating-Gate Memory for Analog
Computing Applications,” 2015 IEEE International Symposium on Circuits
amd Systems,
https://www.ece.ucsb.edu/~strukov/papers/2015/ISCASflash2015.pdf,
Downloaded 11-November-2019
[5] Jeremy Holleman, “Flash-Based Analog Neural Networks: Possibilities
and Trade-Offs,” Flash Memory Summit 2019, Santa Clara California, https://
www.flashmemorysummit.com/Proceedings2019/08-08-Thursday/20190808_
AIML-302-1_Holleman.pdf, Downloaded 11-November-2019
[6] Mark Reiten, “Analog In-Memory Compute using SST SuperFlash,”
Flash Memory Summit 2019, Santa Clara California,
https://www.flashmemorysummit.com/Proceedings2019/08-08-Thursday/201
90808_AIML-302-1_Reiten.pdf, Downloaded 11-November-2019
[7] Bruce L. Morton, “Mitigation of Cross-Talk in Memory Arrays,”
Tagmatech LLC White Paper, September 30, 2013 ,
https://www.slideshare.net/BruceMorton8/tagmatechcrosstalkpaper-
66928548, Downloaded 11-Novemeber-2019
[8] B. Morton, “Memory Device and Method Thereof,” United States Patent
8,189,410, May 29, 2012
[9] B. Morton, “Memory Device and Method Thereof,” United States Patent
8,339,873, December 25, 2012
[10] B. Morton, “Memory Device and Method Thereof,” United States Patent
9,099,169, August 4, 2015
[11] B. Morton, “Memory Device and Method Thereof,” United States Patent
9,530,463, December 27, 2016
[12] Predictive Technology Model, Nanoscale Integration and Modeling
Group, ASU, http://ptm.asu.edu/
[13] S.-C. Wong, G.-Y. Lee, D.-J. Ma, “Modeling of Interconnect
Capacitance, Delay, and Crosstalk in VLSI,” IEEE Transactions on
Semiconductor Manufacturing, vol. 13, no. 1, pp. 108-111, February 2000
[14] GNU Octave, Scientific Programming Language,
https://www.gnu.org/software/octave/, Downloaded 11-November-2019
Bruce L. Morton received the B.S. in electrical engineering
from Oklahoma State University in Stillwater, Oklahoma in
1975. With support of an Engineering Foundation Fellowship,
he earned an M.S. in electrical engineering from the
University of Texas at Austin, Texas in 1976.
After graduation, he joined Motorola in Austin where he
learned the basics of early MOS technology and memory
design, ultimately contributing to the design of DRAM,
SRAM and Non-Volatile memory for commodity markets and
proprietary System-on-Chip products. In 2005, he joined
AMD/Spansion, working primarily on modeling of
developmental memory devices, circuits and architectures.
Being semi-retired since 2009, he has worked independently
on new design ideas, while also consulting on an occasional
basis on the topics of circuit design, devices and technology.
In 2012, he formed Tagmatech LLC to continue to develop
and license new IP.
Nov 24, 2019