A smart front end real-time detection and tracking

A Smart Front-end Real-Time Detection and Tracking
1
Lih-Guong Jang (張立光),
2
JHIH-GUO, PENG (彭智國)
Identification and Security Technology Center,
Industrial Technology Research Institute, Taiwan, ROC
1
E-mail: lihguong@itri.org.tw
2
E-mail: jhihguonpeng@itri.org.tw
Abstract— When a security event occurs, most of conventional
surveillance systems cannot meet the real-time security
analysis requirement; they usually record huge video data on
backend system and spend a lot of manpower to search for the
event pictures. In this paper we present and share the
experience on the TI DM6467 front-end embedded video
surveillance implementation for the real-time security
detection and identification. Based on TI DM6467, we develop
an intelligent surveillance front-end embedded devices called
"S-Box" which performs video analytics, signal sensing, data
fusion and WEB streaming.
Keywords- Embedded surveillance system, dynamic textures,
target tracking, spatial-temporal, background
I. INTRODUCTION
Video surveillance systems can be configured in
centralized or distributed architecture, or some combination
thereof. The design of the appropriate architecture for each
particular situation should balance the needs for (1) central
control and override capability, (2) robust, failure-resistant
operation, (3) autonomous, degraded modes of operation, (4)
peak versus average throughput, (5) expansion requirements,
(6) resistance to compromise, etc.
This paper presents a smart front-end solution for a
configurable embedded intelligence of a real-time video
analysis, signal sensing, data fusion, and monitoring
implementation which is called “S-Box”. In S-Box, novel
data fusion algorithm is proposed to fuse the various kinds
of sensor data including visual sensor, RFID and other types
of sensors. Here, we integrate the information from sensor-
based and vision-based surveillance systems [1] and
perform the data fusion process to construct the “security
metadata” for real time security analytics. Furthermore,
multiple S-Boxes can form a surveillance network in which
a coordination scheme for the networked S-Boxes is used to
track the designated target over a large open space. All the
above-mentioned technologies will be implemented on a
new embedded system which provides high computation
power and module integration capability.
A. S-Box Hardware Design Features
S-Box is designed as a distributed embedded computing
in (1) multi-thread processing for embedded intelligence, (2)
multi-modal data fusion for multi-sensor platform, and (3)
system optimization for embedded multi-media signal
processing.
S-Box consists of the TI DM6467 multi-core processing
unit and the peripheral as showed Figure 1. The peripheral
include (1) the communication unit embodies Power over
Ethernet, and Small Form-factor Pluggable, and (2) the
sensing unit embodies two video input ports, one video
output port, an audio input port, an audio output port, two
discrete input port, two discrete output ports, and a relay out
port.
Figure 1. The prototype of the S-BOX.
According to a first aspect of the Intelligent Video
Analysis (IVA) function, two video signal input with D1
resolution at 30 frames per second per channel to the
processing unit, for real-time video analysis and compression.
The video analysis results from TI DM6467 DSP core
will send to TI DM6467 ARM core [4][5] for the condition
judgment, the final results metadata will be through the
XML protocol then sent to remote backend system. In
addition, the video analysis results will be compressed into
H.264 or MJPEG format by TI DM6467 embedded video
compression engine. The compressed video streaming will
be through the RTSP or RTP/RTCP protocol to transmit to
the remote backend system or Network Video Server.
According to a second aspect of the external data fusion
function, the S-Box peripheral receive sensing signals of
external sensing apparatuses such as cameras, audio or a
discrete signal. The disparate range of sensing signals and
video analysis results will be data fusion by the novel sensor-
centric data fusion model.
S-Box can compressed audio signals into AAC or G711
digital audio format. Then by using RTSP or RTP/RTCP
protocol to transmitted to the remote backend system or
Network Video Server. S-Box also can receive remote
broadcast audio signal and output to an external speaker
device. Four discrete signal input to provide systems for

video recognition analytic parameters. In addition, after the
data fusion results through the decision support in spatial-
temporal situational awareness process, the peripheral
controller based on output result via the Relay ON/OFF to
control the remote alarm device.
B. S-Box IVA Software Design Features
S-Box IVA is based on TI DM6467 to developed an
object detection and tracking and other video analysis
algorithms of the front-end video processing system (Figure
2). The WEB interface can provide users to configure the S-
Box parameter IVA analysis and Streaming parameters. The
WEB interface features are:
1) Do not need to transmit the original video material to
the backend systems for video analysis. The video analysis
will be processed at front-end device to achieve a timely
manner and reduce the effect of network bandwidth
requirements.
2) Onsite video encode and stream videos to backend
Playback provide users with real-time information.
3) Users can control S-Box IVA analysis result by
setting IVA parameters (Y, Cr, Cb, LBP, background updata)
and based on user’s network state to adjusts streaming
parameters (unicast/mutilcast, stream port, stream type) on
WEB interface.
Figure 2. S-Box WEB streaming architecture.
The S-Box IVA program was developed by the TI
DM6467 development tools (VISA APIs, DMAI and SDK
APIs) [2][3][4][5]. The software architecture diagram shown
in Figure 3.
The IVA program includes eight threads: main thread but
eventually turned into control thread, capture thread, IVA
thread, video thread, writer thread, audio decode thread,
stream thread and PTZ thread. In addition to main (control)
thread and stream thread other than the priority of the
remaining threads are set to SCHED_FIFO. Order of priority
as follows: capture thread is the highest priority, video thread
is the second priority, IVA thread and audio thread is the
third priority, writer thread and PTZ thread are the fourth
priority, and control thread is the fifth priority.
Initialization and cleanup of these threads is based on TI
DMAI Rendezvous module to be synchronized. This module
use POSIX conditions to synchronize the implementation of
the thread. After initialization of every thread is completed,
every thread will send signals to the Rendezvous object and
wait. When all threads are completed initialization, all
threads will also unlock and start their own main loop. The
cleanup process is also using the same mechanism.
Figure 3. S-Box software architecture diagram.
S-Box streaming server was developed by "LIVE555"[6],
it supports open standards such as RTP/RTCP, RTSP, SIP
for streaming. The S-Box streaming server included four
parts: "BasicUsageEnvironment" and "UsageEnvironment"
are used when the event occurs, to process event's dispatch.
"Groupsock" processes network socket, mainly used when
using the multi-cast stream. "liveMedia" contains basic
medium, and can manage MPEG4, AAC and H.264.
Figure 4. RTSP communication flow char
As shown in Figure 4 we implement the RTSP
communication process on the RTSP Server. If the RTSP
Server accepts the Client connection, it will according to the
RTSP standard to accepts "PLAY" command, which will
began to capture compressed video address and Frame Size
from Encoded Buffer through inter-process-communication,
and using Unicast protocol transmission data.
II. FRONT-END SURVEILLANCE SYSTEM
This section describes the key technology of the S-Box
IVA algorithm development, including the Background
Modeling, Foreground Detection and texture.

A. Spatial-Temporal Probability Model
In general, target detection can’t be accurate under the
lighting variation environment or clustering background.
Particularly, the lighting reflection and back-lighted
problems can deteriorate the target detection seriously. In
this section, we propose a spatial-temporal probability
background model to segment the foreground and
background on a lighting variant or clustering background.
Furthermore, to detect the foreground efficiently and
robustly, multi-resolution image processing and model-
based background updating are applied.
The intensity variation for each pixel on temporal domain
is modeled by the SDG models. However, when the targets
are detected on a non-stationary or clustering scene, the
pixel distribution of background may change. The statistical
information of the texture distribution may improve
background changing problem. Mixture of spatial and
temporal statistical models are then proposed to remove the
influencing of the non-stationary and clustering background.
The intensity variation of a pixel is shown in Figure 5-(a)
and texture statistics of a pixel around the neighboring
pixels is shown in Figure 5-(b). Finally, the spatial-temporal
probability model is defined as:
     | | |x x s s s x t t t xp I B w p I B w p I B  , (1)
where It and Is are the intensity value measured among the
temporal axis and the spatial neighboring pixels respectively,
and Bx is the pixel distribution of background. The values of
the weighting factors, ws and wt, should sum up to one.
Figure 5. (a) Temporal variation of a background pixel. (b) Spatial
variation of a background pixel among the neighboring pixels.
Then, the likelihood ratio using the spatial-temporal
probability model is defined as:
 | xp I B
L

 , (2)
where λ is constant. If p(I|Bx) ≥ λ/L, then the pixel belongs
to the background otherwise it belongs to the foreground.
In addition, because it is difficult to detect objects when
the intensity distribution is close to the background model,
the fusion of likelihood ratios of three color components
(RGB or YCrCb) are proposed to overcome this problem. In
general, there are two fusion rules to detect the foreground
about linear combination and voting rules as:
By comparing several fusion rules, we apply the voting
rule to cope with the illumination variation problem. If a
pixel is classified as background with more than two
components’ background models, then this pixel is
classified as background, otherwise, it is classified as
foreground. Figure 6 illustrates the foreground detection
using the voting rule. It is obvious that the foreground
detection using voting rule outperform the one using linear
combination rule. Hence, we apply the voting rule to detect
the objects on the outdoor crowd scene to cope with the
illumination variation problem.
Figure 6. Object detection using spatial-temporal probability model.
Linear combination rule:
If wy pY(uy|Bx)+wcr pCr(uCr|Bx)+wcb pCb(uCb |Bx) > T
pixel u is classified as background,
otherwise,
pixel u is classified as foreground,
where, wy , wcr , wcb are weighting factors and sum of
the weighting factors is equal to one.
Voting rule:
Given pY(uy|Bx), pCr(uCr|Bx), pCb(uCb |Bx ),
If a pixel is classified as background with more
than two components’ background models,
{ pY(uy|Bx) < λ/L and pCr(uCr|Bx)< λ/L } or
{ pCr(uCr|Bx) < λ/L and pCb(uCb |Bx ) < λ/L } or
{ pY(uy|Bx) < λ/L and pCb(uCb |Bx ) < λ/L }
Pixel u is classified as background,
otherwise,
Pixel u is classified as foreground.

Endif
,clear
,foregroundFalse
else
,update
,foregroundTrue
,If
,count
foregroundIf
)R(O
O
)R(O
O
N))|ON(R(F
)|OR(F
)R(O
c
c
c
c
thc
LBP
c
c
LBP
c
c




B. Foreground Verification using Texture Modeling
Many environmental dynamic textures such as leaves, fire,
smoke, and sea waves may reduce the accuracy of target
detection. Here, the dynamic texture will be modeled by
using the modified local binary pattern (LBP)[11] and then
the target can be detected without the influence of dynamic
textures in the crowd scene. Here, a local texture pattern T
[11] centering the pixel gc and having P neighboring pixels
is defined as:
))(),...,(),(( 110 cPcc ggsggsggstT  
, (2)
where






thresholdx
thresholdx
xs
||,0
||,1
)( . (3)
Figure 7 shows the threshold value of single image
texture variation in the pixel-wise LBP texture model, the
greater the threshold value the more difficult to detect the
image texture differences, and easy to misrecognition the
neighboring pixels as same texture.
Figure 7. Threshold example.
Then, we transform the modified LBP in (2) to an integer
value with the formula in Eq. (4).



1
0
2)(
P
P
cpPR ggsLBP . (4)
1) Foreground Detection using the Modified LBP
Here, we apply the modified LBP to perform the dynamic
texture background modeling and remove the false
foreground detection. In the LBP-based foreground
detection, two threshold values are required to estimate the
bit difference  between the captured scene and LBP-based
background model. The LBP-based foreground detection
rule is defined as:






thifbackground
thifforeground
P tframe
_,
_,
)()1(


 . (5)
The bit difference  is calculated as:



8
0
)()1(
)(
p
tframe
p
tframe
p LBPXORLBP (6)
where, p is the index of the pixel on the circular chain.
Figure 8 shows the _th bit compare value is response the
neighboring image texture difference sensitivity. Large _th
bit compare value is more difficult to detect the neighboring
image texture difference, and more easy to determine the
same position of neighboring images as similar texture.
Figure 8. _th example.
2) Foreground Variation
In this study, both the pixel-wise temporal probability
model and LBP texture model are constructed to detect the
foreground, but how to integrate both background models to
reduce the false detection is a very important issue. Based
on the careful observation of foreground detections, the
foreground detection rule is then designed as:
where, R(Oc) denotes the region of a detected object
using the pixel-wise temporal probability model on the
current frame c, R(FcLBP|Oc) denotes the region of the
foreground detected by pixel-wise LBP texture model on the
current frame c around the region of Oc. In order to correct
the false detection, we propose the update/clear method as
follow:




 c
c
Clearif,
Updateif,)O|R(F)(
)(OR'
foreground
foreground
LBP
cc
Null
OR  ,(7)
Figure 9 shows the Nth threshold value of neighboring
image texture difference in total pixels. Large Nth threshold
value will need for more substantial differences in the
number of texture pixels to determined is a foreground
object.
Figure 9. Nth Threshold example.

III. EXPERIMENTAL RESULTS
This implementation performs on D1 resolution,
maximum 10 object detection with trajectory and 10 fps
detection rate at one Tripzone alert function while
simultaneously, S-Box output 20 fps H.264 video with OSD
message to the backend server. Consequently, S-Box has
implemented a set of IVA parameters (Figure 10) and we can
adjust these parameters to meet the environmental conditions.
The following shows the steps and results of S-Box real-time
surveillance under sun-day condition.
 Adjust the "Y Threshold" value to 500 to reduce the
brightness sensitivity, then the brightness variation will
not be misjudged as foreground caused by the clutter
background.
 Adjust the CR and CB Threshold value to 300 to
reduce red and blue reaction sensitivity, then the
reddish and bluish background will not be misjudged as
foreground.
 Adjust the "LBP Threshold" value to 20 to reduce the
texture difference sensitivity, then the texture variation
in the single image pixel will be regarded as the same
texture.
 Adjust the "LBP Label Threshold" value to 40 to
reduce the sensitivity of total texture pixels from the
neighboring images, then amount of texture pixels with
substantial differences will be determined as
foreground object.
 Setup the depth of the scene and divide into three
regions of the image border by "Upper Bound" value to
40 and "Middle Bound" value to 80.
 Setup the top region of the far-field scene by "Top
Label Size" value to 20 for small size object; the
middle region by "Middle Label Size" value to 40 and
the lower region of the close-field scene by "Bottom
Label Size" value to 120 for large size object.
Figure 10. IVA Threshold Parameters.
In Fig. 11-(a), it shows an outdoor scene. Fig. 11-(b)
represents the foreground with the pixel-wise temporal
probability model, and the dynamic texture detection model
is described in Fig. 11-(c). By using the dynamic detection
model, the targets will be separated into the truly foreground
target and the constant texture object. If the object with too
many constant textures, we will define the target as the noise,
and it then will be removed, i.e. in Fig. 11-(d). Finally, we
can improve the accuracy on the detected target.
Figure 11. Moving target and constant texture target detection on an
outdoor scene. (a) Outdoor scene. (b) The objects are detected by using
pixel-wise temporal probability model. (c) The dynamic texture detection
model. (d) The extracted objects after the texture noise removing process.
According to the above outdoor scene test, S-Box shows
different performances in different time sectors during
summer. That is, detection rate from 12:00 to 15:00 is 60%
because of the strong sunlight. But, the detection rate can be
increased to 95% from 15:00 to 18:00 due to the decrease of
sunlight. But this experimental result is based on the same
parameters. Therefore, The next stage of S-Box
implementation shall be add on a automatic environmental
variation detect function to increase the surveillance
performance, making S-Box to be resistant to weather and
environment impact.
IV. CONCLUSION
In this paper, we implement the IVA and WEB streaming
on the TI DM6467 platform as a smart box (S-Box).
Through the WEB streaming technique, users can setup a
trip wire or trip zone from the remote site. Based on the
condition, S-Box can do the object detection and tracking.
After video analysis, S-Box can output metadata with H.264
or MJPEG video stream data to the backend server. In
addition, S-Box will notice the users by activating the DI or
DO interface to trigger alarm sound or other device.
Under S-Box surveillance system, it performs as a
distributed solution and help to solve the conventional
surveillance problems, such as: cable vandalization or server
loss connection. S-Box provides the front-end stand along
operation, and the experimental results show the proposed
embedded system can perform five function (trip zone, trip
wire, object detection, object labeling and object trajectory)
with the rate above 10 fps.
ACKNOWLEDGMENT
We would like to thank Professor Cheng-Chang Lien for
his generous support and successful cooperation.
(a)
(b)
(c)
(d)

REFERENCES
[1] R. Aguilar-Ponce, A. Kumar, J. L. Tecpanecatl-Xihuitl and M.
Bayoumi, “A network of sensor-based framework for automated
visual surveillance”. Journal of Network and Computer Applications,
2007, pp. 1244–1271.
[2] LSP 2.00 DaVinci Linux VPIF Capture Video Driver (SPRUG99.pdf)
[3] LSP 2.00 DaVinci Linux Video VDCE Driver (SPRUGA3.pdf)
[4] Configuring Codec Engine in Arm apps with creatFromServer (wiki)
[5] Changing the DVEVM memory map (wiki)
[6] http://www.live555.com/liveMedia/doxygen/html/classes.html
(Live555)
[7] A. Elgammal, D. Harwood and L. Davis, “Non-parametric model for
background subtraction,” in Proceedings of the 6th European
Conference on Computer Vision, 2000, pp. 751-767.
[8] R. Jain, W. Martin and J. Aggarwal, “Segmentation through the
detection of changes due to motion,” Compute Graph Image Process
11, 1979, pp. 13–34.
[9] Y. Ren, C. S. Chua and Y. K. Ho, “Motion detection with
nonstationary background,” Machine Vision and Application, Vol. 13,
No. 5-6, Mar. 2003, pp. 332–343.
[10] C. C. Lien and S. C. Hsu, “The target tracking using the spatial-
temporal probability model,” IEEE International Conference on
Nonlinear Signal and Image Processing, NSIP 2005, May 2005, pp.
34-39.
[11] M. Xu, J. Orwell, L. Lowey and D. Thirde, “Architecture and
algorithms for tracking football players with multiple cameras” Image
and Signal Processing, IEE Proceedings, Vol. 152, Issue 2, April
2005, pp. 232-241.
[12] K. Nummiaro, E. K. Meier and L. J. V. Gool, “An adaptive color-
based particle filter” Image Vision Computing, Vol. 21, Issue. 1, 2002,
pp. 99-110.
[13] W. Hu, M. Hu, X. Zhou, Tieniu Tan, “Principal axis-based
correspondence between multiple cameras for people tracking” IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 28,
No. 4, April 2006, pp. 663-671.
[14] E. Alpaydin, Introduction to Machine Learning. MIT Press,
Cambridge 2004.
[15] T. Ojala, M. Pietikaïnen, and T. Maënpaä¨, “Multiresolution Gray
Scale and Rotation Invariant Texture Analysis with Local Binary
Patterns,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 24, no. 7, pp. 971-987, July 2002.

A smart front end real-time detection and tracking

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie A smart front end real-time detection and tracking

Ähnlich wie A smart front end real-time detection and tracking (20)

Mehr von Lihguong Jang

Mehr von Lihguong Jang (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A smart front end real-time detection and tracking