Phd T H E S I Sproposal

Table of Contents

Abstract ……………………………………….……… 2

Motivation ……………………………………………… 3

History ……………………………………………… 3

Principals ……………………………………………… 4

Techniques ……………………………………………… 5

Related Work ……………………………………………… 7

Limitation of Existing Techniques ……………………. 18

Proposed Solution ……………………………………… 18

Application ……………………………………………… 18

References ……………………………………………… 19

Abstract

Mark Owens [9]
!quot;#

With computers having GHz of processing speed, information / data either stored or in

transmission has become more and more vernalable to hostile eavesdropping, theft,

wiretapping etc. This urges us to devise new data hiding techniques to protect and secure data

of vital significance. Steganography is a method of securing data by obscuring the contents in

another media (called Cover) in which it is saved / transmitted. This doctorial thesis proposal will

present a new Steganographic Technique for hiding data in (ASCII) text files together with its

Software implementation, a research area in Steganography which is considered as

toughest among all, to address.

$

Motivation
While Net surfing, I encountered an on-line article in the USA Today titled “Terror

groups hide behind Web encryption” claiming (though not yet publicized evidence exist)

terrorists may be using steganography to communicate with each other in planning terrorist

attacks, that twigged my interest for evolving a new concealment technique. It is intuited that

images with hidden messages have ideal cover on bulletin boards or dead drops for other

terrorists to pick up and resolve.

History
Steganography dates back to ancient Greece when etching messages or images in

wooden tablets and covering them with wax, and tattooing a shaved messenger's head, letting

the hair grow back, and then shaving the head again to read the message were common

practices.

Early in WWII steganographic technology consisted almost exclusively of invisible inks.

Sources for invisible inks include milk, vinegar, fruit juices and urine that darken when heated.

The following message was sent by a German spy during WWII:

Apparently neutral's protest is thoroughly discounted
and ignored. Isman hard hit. Blockade issue affects
pretext for embargo on by products, ejecting suets
and vegetable oils.

Taking the second letter in each word the following message emerges:

Pershing sails from NY June 1.

When invisible inks became easy to decode through improved technology, null ciphers

were used. Null ciphers are unencrypted messages that are indiscernible in innocent sounding

messages. An example of such a message is:

%

Fishing freshwater bends and saltwater coasts rewards
anyone feeling stressed. Resourceful anglers usually
find masterful leapers fun and admit swordfish rank
overwhelming anyday.

Taking the third letter in each word the following message emerges:

Send Lawyers, Guns, and Money.

The Germans developed the microdot technology during WWII. Microdots are text or

photographic images that are shrunk down to the size and shape of a period or the dot of an i or

j. Microdots were usually sent by writing a letter containing periods, i's, or j's, and the intended

recipient could read the messages using a microscope. Because of the extremely small size of

the microdots the messages typically went unnoticed by inspectors.

A steganographic message generally appears to be something else, like an article or a

picture, or some other quot;coverquot; message. Drawings have often been used to conceal information

since it is easy to encode a message by varying lines, colors or other elements in pictures. This

tutorial will focus on image files to hide text messages.

Principals:
Steganography can be split into two types, these are Fragile and Robust. The following

section describes the definition of these two different types of steganography.

Fragile

Fragile steganography involves embedding information into a file which is destroyed if the

file is modified. This method is unsuitable for recording the copyright holder of the file

since it can be so easily removed, but is useful in situations where it is important to

prove that the file has not been tampered with, such as using a file as evidence in a

court of law, since any tampering would have removed the watermark. Fragile

steganography techniques tend to be easier to implement than robust methods.

&

Robust

Robust marking aims to embed information into a file which cannot easily be destroyed.

Although no mark is truly indestructible, a system can be considered robust if the

amount of changes required to remove the mark would render the file useless. Therefore

the mark should be hidden in a part of the file where its removal would be easily

perceived.

There are two main types of robust marking. Fingerprinting involves hiding a unique

identifier for the customer who originally acquired the file and therefore is allowed to use it.

Should the file be found in the possession of somebody else, the copyright owner can use the

fingerprint to identify which customer violated the license agreement by distributing a copy of the

file.

Unlike fingerprints, Watermarks identify the copyright owner of the file, not the

customer. Whereas fingerprints are used to identify people who violate the license agreement

watermarks help with prosecuting those who have an illegal copy. Ideally fingerprinting should

be used but for mass production of CDs, DVDs, etc it is not feasible to give each disk a

separate fingerprint.

Watermarks are typically hidden to prevent their detection and removal, they are said to

be imperceptible watermarks. However this need not always be the case. Visible watermarks

can be used and often take the form of a visual pattern overlaid on an image. The use of visible

watermarks is similar to the use of watermarks in non-digital formats (such as the watermark on

British money).

Techniques:
Information hiding techniques are receiving much attention today. The main motivation

for this is largely due to fear of encryption services getting outlawed, and copyright owners who

'

want to track confidential and intellectual property copyright against unauthorized access and

use in digital materials such as music, film, book and software through the use of digital

watermarks.

A Steganographic System:

f E: steganographic function quot;embeddingquot;

fE-1: steganographic function quot;extractingquot;

cover: cover data in which emb will be hidden

emb: message to be hidden

key: parameter of fE

stego: cover data with the hidden message

A Graphical Version of the Steganographic System:

Steganographic messages may first be encrypted and then a cover message is modified

to contain the encrypted message, resulting in stego text. Only those who know the technique

used can recover the message and, if required, decrypt it.

(

The message may be a few thousand bits (often at 7 or 8 bits per text character)

embedded in millions of other bits. Probably the most typical use is digital images. Digital

images are commonly stored in either 24-bit or 8-bit files. If an 8-bit image is viewed as a grid

and the grid is made up of cells, these cells are called pixels. Each pixel consists of an 8-bit

binary number (or a single byte), and each 8-bit binary number refers to the color palette (a set

of colors defined within the image). All color variations for the pixels are derived from three

primary colors: red, green, and blue. Each primary color is represented by 1 byte (= 8 bits).

Digital watermarking technology is viewed as quot;an enabling agent allowing more

widespread sharing and use of that content while decreasing worry over piracy”. Today

steganography is often used for digital watermarking to hide copyright or ownership information

in an image, movie, or audio file. A copyright holder can pull the hidden copyright or ownership

information out of a suspect file to prove it is stolen. Digital watermarking is not used for

authenticating documents. (Digital signatures perform this task.) A digital watermark refers to

the ability to unobtrusively include information in a file, and is commonly executed through a

variety of cryptographic techniques, collectively known as steganography.

Algorithms and transformations: Another steganography technique is to hide data in

mathematical functions that are in compression algorithms. The idea is to hide the data bits in

the least significant coefficients.

Other techniques of steganography include spread spectrum steganography, statistical

steganography, distortion, and cover generation steganography.

Related Work (Text Techniques)
While it is very easy to tell when you have committed a copyright infringement by

photocopying a book, since the quality is widely different, it is more difficult when it comes to

*

electronic versions of text. Copies are identical and it is impossible to tell if it is an original or a

copied version. To embed information inside a document we can simply alter some of its

characteristics. These can be either the text formatting or characteristics of the characters. You

may think that if we alter these characteristics it will become visible and obvious to third parties

or attackers. The key to this problem is that we alter the document in a way that it is simply not

visible to the human eye yet it is possible to decode it by computer.

+

Figure above, shows the general principle in embedding hidden information inside a

document. Again, there is an encoder and to decode it, there will be a decoder. The codebook is

a set of rules that tells the encoder which parts of the document it needs to change. It is also

worth pointing out that the marked documents can be either identical or different. By different,

we mean that the same watermark is marked on the document but different characteristics of

each of the documents are changed.

Line Shift Coding Protocol

In line shift coding, we simply shift various lines inside the document up or down by a
th
small fraction such as 1/300 of an inch) according to the codebook. The shifted lines are

undetectable by humans because it is only a small fraction but is detectable when the computer

measures the distances between each of the lines. Differential encoding techniques are

normally used in this protocol, meaning if you shift a line the adjacent lines are not moved.

,

These lines will become a control so that the computer can measure the distances between

them.

By finding out whether a line has been shifted up or down we can represent a single bit,

0 or 1. And if we put the whole document together, we can embed a number of bits and

therefore have the ability to hide large information.

Word Shift Coding Protocol

The word shift coding protocol is based on the same principle as the line shift coding

protocol. The main difference is instead of shifting lines up or down, we shift words left or right.

This is also known as the justification of the document. The codebook will simply tell the

encoder which of the words is to be shifted and whether it is a left or a right shift. Again, the

decoding technique is measuring the spaces between each word and a left shift could represent

a 0 bit and a right bit representing a 1 bit.

The quick brown fox jumps the lazy dog.
- ./ 0

Line Shift Coding Protocol

In this example the first line uses normal spacing while the second has had each word

shifted left or right by 0.5 points in order to encode the sequence 01000001 that is 65, the ASCII

character code for A. Without having the original for comparison it is likely that this may not be

noticed and the shifting could be even smaller to make it less noticeable.

Feature Coding Protocol

In feature coding, there is a slight difference with the above protocols, and this is that the

document is passed through a parser where it examines the document and it automatically

builds a codebook specific to that document. It will pick out all the features that it thinks it can

use to hide information and each of these will be marked into the document. This can use a

number of different characteristics such as the height of certain characters, the dots above i and

1

j and the horizontal line length of letters such as f and t. Line shifting and word shifting

techniques can also be used to increase the amount of data that can be hidden.

White Space Manipulation

One way of hiding data in text is to use white space. If done correctly, white space can

be manipulated so that bits can be stored. This is done by adding a certain amount of white

space to the end of lines. The amount of white space corresponds to a certain bit value. Due to

the fact that in practically all text editors, extra white space at the end of lines is skipped over, it

won’t be noticed by the casual viewer. In a large piece of text, this can result in enough room to

hide a few lines of text or some secret codes. A freely available program which uses this

technique is named “SNOW”.

Text Content

Another way of hiding information is to conceal it in what seems to be inconspicuous

text. The grammar within the text can be used to store information. It is possible to change

sentences to store information and keep the original meaning. TextHide is a program, which

incorporates this technique to hide secret messages. A simple example is:

Changed to:

2 - 3

Another way of using text itself is to use random words as a means of encoding

information. Different words can be given different values. Of course this would be easy to spot

but there are clever implementations, such as SpamMimic which creates a spam email that

contains a secret message. As spam usually has poor grammar, it is far easier for it to escape

notice. The following extract from a spam email encodes the phrase 45

Dear Friend , Especially for you - this red-hot intelligence . We will comply with all removal requests .

This mail is being sent in compliance with Senate bill 2116 , Title 9 ; Section 303 ! THIS IS NOT A GET

RICH SCHEME . Why work for somebody else when you can become rich inside 57 weeks . Have you

ever noticed most everyone has a cellphone & people love convenience. Well, now is your chance to

capitalize on this . WE will help YOU SELL MORE and sell more! You are guaranteed to succeed

because we take all the risk ! But don't believe us . Ms Simpson of Washington tried us and says quot;My

only problem now is where to park all my carsquot; . This offer is 100% legal. You will blame yourself

forever if you don't order now ! Sign up a friend and you'll get a discount of 50%. Thank-you for your

serious consideration of our offer . Dear Decision maker;

Thank-you for your interest in our briefing . If you are not interested in our publications and wish to be

removed from our lists, simply do NOT respond and ignore this mail ! This mail is being sent in

compliance with Senate bill 1623 ; Title 6 ; Section 304 ! THIS

IS NOT A GET RICH SCHEME ! Why work for somebody else when you can …

A very basic form of steganography makes use of a cipher. A cipher is basically a key

which can be used to decode some data to retrieve a secret hidden message. Sir Francis Bacon
th
created one in the 16 Century using messages with two different type faces, one bolder than

the other. By looking at the positions of the bold characters in relation to the rest of the text, a

secret message could be decoded. There are many other different ciphers which could be used

to the same effect.

XML

XML is becoming a widely used standard for data exchange. The format also provides

plenty of opportunities for data hiding. This is important for verifying documents to see if they

have been altered and also for copyright reasons. You can embed a code for example, which

can be traced back to the source. A method for hiding information in XML comes courtesy of the

University of Tokyo.

Many different files can exist when XML is used. There is the XML file itself but there can

be transformation files (.xsl), validation files (.dtd) and style files (.css). All of these files can be

used to hide data but the main XML file is usually the best due to its larger size. This technique

concentrates on just the XML file, more elaborate techniques could use a combination of all four

files to increase robustness.

One way of hiding data in XML is to use the different tags as allowed by the W3C. For

example both of these image tags are valid and could be used to indicate different bit settings

Stego key:
<img></img> -> 0
<img/> -> 1
In this way a piece of XML like the following could be used to encode a simple bit string.
Stego data:
<img src=”foo1.jpg”></img>
<img src=”foo2.jpg”/>
<img src=”foo5.jpg”></img>

The XML data in this case stores the bit strings 101100 and 010011.

Other ways of storing data include using the order in which attributes or elements

appear. For example, assigning the combination of element A followed by element B the bit

value of 1 while if A is followed by some element C, it would be assigned the value of 0.

Hiding data using the scheme outlined above would be pretty easy. In the case of using

white space, a simple text manipulation program could be used to add the spaces and then a

reader could be created to parse the XML and retrieve the hidden data. The same is true for the

usage of different tags. The structure of elements would be a little more difficult as changing

elements could have an adverse impact on the way the XML is displayed but if cleverly

designed, this could be overcome. In this example the containment of elements is used:

<favorite><fruit>SOMETHING</fruit></favorite> -> 0

$

<fruit><favorite>SOMETHING</favorite></fruit> -> 1

In this example the order of the elements is used:

<user><name>NAME</name><id>ID</id></user> -> 0

[2]
<user><id>ID</id><name>NAME</name></user> -> 1

Microsoft Soft Office Suit

A great deal of research has been accomplished in the area of hiding data in text, image,

or audio files. There does not seem to be a lot of research in the area of hiding data inside

unused space. The only related work found is by Eric Cole in his book “Hiding Data in Plain

Sight” where he gives several examples of how to hide data in various file structures, including

the properties section of Word documents.

In the world of spy vs. spy, covert communication, or steganography, is not a new

concept. This ancient art has been used in many ways and in many mediums and has not been

ignored in this century with the bits and bytes of the computerized world. Many methods have

been found for hiding covert messages and data in computer files. One only has to search the

Internet for steganography, or stego for short, to find multiple freeware utilities that will allow

even a novice computer user to create files with hidden communications. However, where there

is a desire to hide communication, there is also a desire to detect that communication. For this

reason, there are also tools available online to detect covert data in image files. How dangerous

is a hiding place that everyone knows about? What if someone sending covert data used file

types less commonly used for steganography such as MS Word documents? Would that

communication escape notice? Can these files even carry a covert message?

With the large amount of traffic that traverses networks daily it is impossible for any

single administrator or investigator to examine all data. When examining network traffic a

system administrator is limited to the traffic they consider suspicious or dangerous. A system

administrator must know the normal traffic across their network and investigate when something

%

odd occurs. There are a large number of programs today that will hide data in image or audio

files. Therefore, data could be stored inside one of these and sent across the network

decreasing suspicion. However, what if, instead of pictures, someone sends a Word document.

Then they send a Power Point presentation followed by any number of common office

documents. This varying of file types would create less suspicion by appearing to be normal

traffic. Can these files carry covert information? Yes, they contain meta-data and unused bits

that can be replaced without obvious effect.

The programs mentioned above that hide data in images perform steganography. There

are numerous, well-published ways to use steganography in the hiding of information in image

and audio files. However, a lesser considered area is the simple hiding of information inside

common office files. These spaces are not well-known or well-documented. They can be used

relatively easily to hide data and using them decreases suspicion as stated above. Also, using

these spaces with bit substitution keeps the original file size. This reduces the chance for

automated detection or analysis. For these reasons and more, these spaces should be made

aware to investigators.

Unused Space and Meta-data Defined

Some files contain readily available spaces that can be used inside their file structures.

One possible example could be meta-data, data about data. Meta-data is ingrained in file

structures but not visible to the user without special tools. Some files also have unused space.

They contain bits that can be overwritten without any adverse or obvious effect on the file.

These spaces are not visible to the average user because they are ignored when the files are

opened. These spaces can be seen when examined at the byte level, something few users

would do. These spaces create an opportunity to hide covert data. This paper shows the results

of examining several common office files to see if they have these spaces and whether or not

&

they could be used to hide data. It is not our intent to suggest their use, but rather to document

their existence as a vulnerability and possible data leakage point.

The Experiments and General Observations

The first sets of tests were run on the Microsoft Office documents: Word, Excel, and

Power Point. Next html and email files were examined. Finally, compressed files were tested.

Each file type was put through the same set of tests. The presence or absence of meta-data

and unused space was immediately obvious in all file types. It was most prevalent in Microsoft

Word. This file type not only kept metadata but also contained history information about the

document. It contained such things as who created it, and where it was printed.

Along with these meta-data sections, large groups of the repeated hex value FF or 00

were noticed in some file types. These spaces were ideal for hiding data. For each file type,

several files of different sizes were examined to determine if these spaces were constant. The

spaces seem to be more dependent on the version used to create the file than on the file

contents. Replacing these spaces with our data was accomplished but the data could not be

inserted in this area without noticeable side effects. Inserting data changes the length of the file

and the format of the file structure, so once the file is saved it cannot be opened without error

messages. Sometimes, it could not be opened at all. Therefore, inserting the data is easily done

and possible but it corrupts the file in the process. This held true for all the file types that did not

consist of plain text like web pages. Data inserted at the end of the file did not cause this effect

but did affect the file size, which could help identify that file as containing hidden data. Each file

type was tested to see if data could be hidden at the end of the file, after the end of file pointer.

All proved susceptible to this technique except html and email files. Data in either place proved

to be volatile. Once anyone opens and saves the document, the hidden data is destroyed. Now

details concerning each one of the file types will be discussed.

'

Results by File Type

Word documents were the first to be tested. 780 bytes of repeated values were

discovered and utilized to hide data. Excel files were examined next. The findings were similar

to those of Word, however, Excel had fewer spaces in which to hide data. The largest

continuous block was approximately 420 bytes found just below the header. Finally Power Point

files were examined. The results were the same as the Excel files, except they did seem to have

more of the smaller hiding places. In Word the plain text was obvious. In Excel the numbers

could be seen. Power Point was not so obvious making searching for hiding places harder.

In summary Microsoft Office files provided many opportunities for hiding data. Inserting

data caused the file to become corrupt, but they had plenty of unused space that could be

written over. This could be avoided by inserting data at the end of the file. Another peculiarity

was the need to avoid the area where Microsoft stores its file property information. This area

had to be avoided to prevent others from easily viewing the hidden data. This was discussed in

which provided source code for a program that could be used to hide data in this spot. Other

than this limitation, the inserted data was not apparent and was stable as long as the file was

not altered or saved.

Web files were tested next. Html and email files are actually no more than text files that

are interpreted by another program. Text files have no headers and no unused space. There are

ways to hide data in text, but there are no data hiding vulnerabilities in the file structures of a

simple text file that we are aware of. However, web pages contain areas that are ignored during

web page creation. There is no real unused space to hide data in, but these ignored areas

create meta-data hiding opportunities. Web browsers also ignore commands they see as errors,

so data can be hidden by placing it inside the symbols “<>.”

These methods have a draw-back. Web browsers normally contain the option to “view

source.” This is not an often used tool but it allows any user to view the hidden text with ease.

(

The data could be encrypted or made to look like meta-data using a grammar-based

substitution technique but its presence could still be easily detected.

Email files proved to be similar to html files. They are also plain text files that are interpreted by

other programs. Emails contain information about each server that the email traveled through.

Data can easily be hidden here by mimicking this server information. Simply insert the data

following the word “Received:”. Most email programs today would not display this information by

default. Just as in html/htm documents, one has only to view source or open the file in a text

editor to see the hidden data. In summary, web files could be used to hide data easily, but the

ease of use is balanced by the ease of discovery.

When dealing with electronic transfer where space must be conserved, it would not be

uncommon to see compressed files, such as WinZip. Therefore compressed files were studied

next. Due to the nature of these files, they are not as vulnerable to hiding. One function of a

compression algorithm is to look for long strings of redundant bytes and transform them into

smaller strings that represent them. Therefore, the long strings of repeated values being used to

hide data here would have been reduced or eliminated. However, because of the commonality

of these files, tests were run to confirm this.

Data was successfully added after the end of file marker, but there were no unused

spaces inside them to use for hiding data. It was also noted that compressing a file with hidden

data and then uncompressing it did not affect the hidden data. In addition while the file was

compressed the hidden data was not readable with the hex editor. The compressed files

containing hidden data were larger than the uncompressed files because of the reduction of the

redundant bits when the substitution of hidden data was done. This could possibly be a red flag

for hidden data if the reduction ratios of files were used to check file sizes. [1]

*

Limitations of Existing Text based Steganographic Techniques

Following are the major drawbacks in the above cited techniques:

Data hidden in .doc files is lost when saved in PDF/ASCII – Text format etc.

Increase / Decrease in line / word spacing is eye-catching, and so is the separation of

words / lines with extra spaces.

Placing extra spaces at the end of a sentence can go un-noticed except if one selects a

page or an entire document for copy etc., where the extra spaces become prominent.

Adding spaces past end of file mark can create doubts because of increased file-length.

Proposed Solution

Till today, no known Text-based data hiding technique exist that can hide information

without increasing / decreasing document length and / or altering the text appearance.

The proposed thesis is aimed at evolving a coding technique that will hide data within actual

contents of the Text file, used as cover, taking care of all of the existing drawbacks in Text-

based Steganographic Systems, dully supported by a complete software solution.

This will eradicate the possibility of losing hidden data at the time of compression or

conversion of the text to “pdf” file format. In addition, any one in possession of the actual cover

will not find a change in the contents and layout of the stego-text document on comparison.

APPLICATION:

This technique can best be applied on web pages for un-noticed global interaction,

where the entire concentration is primarily focused on images and text spacing. A real time

demonstration of this fact will also be given.

,

References

6. 47 + 4 8 ! 2 + 9:
8 :

2. 9;; ;< (6 00 ;! '1(;
Steganography And Digital Watermarking, 2004 Jonathan Cummins, Patrick Diskin, Samuel
3.
Lau and Robert Parlett,School of Computer Science, The University of Birmingham.
! quot;#$$%& 7 '
8 9: 4 8
( ) $ $*) #$$% #$%##$
+ ! ,- . ' / 0/ 04 666 !2 = quot;# 6> 82 !4 ?
6
+ $12**%)$%)3 2 $$ 4 #$$% ' ' 5' ( 5' 7
67 8.
9 / /: 0 9 #$$2
8 7 = @9: 7 =@
2 . 5 ; '< = .
2
9quot; & 2 1 >
- **>
> ? @ '? < 8 %+ %(
-
% %1%9 **9
* ? @ '? < 8 %* %(
:
+ 219> #$$$
$ A ? B/ . ! 6 A =, > '
+ 7 4
@ ,/ %21 $ = =
9;; ; A; ;8 A ,'A A
0 **9
;. 5 <. C/ 4 7 3: 8
=/ ' >2 2 $9#1 $2> < ***
+ 7 ! 6 - quot; .+
#
D E E D D D! quot;
E!
# $# quot; % & #'
( &#) quot; ' ## * +% # !# # &
% , :A / * #$$# . ) ' . % F: + ! !
8
' 6 G! ' 9;; ; .7 ;;
+ / , ! ?-. =6/H = @ ,
9;; ;B ; ; C 3/
8 #$$#
9 </ ; <I / F6. 8 98 #
' ;0 **>-#91%
// ; ; 1'
2 ;</ < / <I F8 ! 9
= '=
4 7 4
6/ 5 ' 1651**1 $1 ;< : 0 ***

Phd T H E S I Sproposal

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Phd T H E S I Sproposal

Ähnlich wie Phd T H E S I Sproposal (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Phd T H E S I Sproposal