In this project, you will be writing a single function; this function will (very approximately) recreate the DEFLATE algorithm – which will show you how a .zip file stores compressed data. You will be passed an array of tokens, which represent the data in the compressed file; you will return a string, which is the uncompressed data.
2 Python Hints:
type() Python has a handy function, which you can use to check for the type of any object: type. You can compare the return value to the various standard Python types, like this: foo = [1,2,3] if type(foo) == list: print(“Yep, it’s a list!”) if type(foo[0]) == int: print(“Yep, element 0 is an integer”) Other common types you might want to check for: str tuple set dict
Requirements: it doesn’t matter | .doc file
CS 120 (Spring 21): Introduction to Computer Programming II
Long Project #10
due at 5pm, Thu 25 Mar 2021
1 Overview
In this project, you will be writing a single function; this function will (very
approximately) recreate the DEFLATE algorithm – which will show you how a
.zip le stores compressed data. You will be passed an array of tokens, which
represent the data in the compressed le; you will return a string, which is the
uncompressed data.
2 Python Hints: type()
Python has a handy function, which you can use to check for the type of any
object: type. You can compare the return value to the various standard Python
types, like this:
foo = [1,2,3]
if type(foo) == list:
print(“Yep, it’s a list!”)
if type(foo[0]) == int:
print(“Yep, element 0 is an integer”)
Other common types you might want to check for:
str
tuple
set
dict
3 Python Hints: assert
An assert is a statement, somewhere in your code, which states something
which must be true. Sometimes, we do this to sanity-check our own code (so
that we’ll notice if something goes wrong); sometimes, we use it as a way to
report errors, if somebody gives us bad input.
To write an assert, you simply write the keyword assert, followed by some
sort of Boolean expression (that is, something which resolves to True or False).
For instance, if you believe that a string will never be empty, you could write
assert msg != “”
1
Every time that Python encounters this line of code, it will check the con-
dition. Hopefully, it’s always true; if it is, then Python keeps running your
program . But if it turns out to be false, your assert will throw an excep-
tion.” We’ll dene exceptions later – but for now, you can know that exceptions
kill the function they are in, and the calling function, and the one that called
that – eventually killing your entire program. But it’s possible to write code
that will catch” an exception – this means that we will notice the exception,
handle it, and prevent it from killing the program.
In this assignment, some of the testcases will intentionally send you bad
input, and we expect you to detect this with an assert. If you do this properly,
the testcase will catch the exception, print out a success message, and you pass
the testcase! But if you don’t check for the error, and keep running, then you
will fail the testcase.
On the other hand, if you fail an assertion when we didn’t expect it, then
this will kill the testcase – and that will be a dierent sort of failure.
4 Zip Format
Did you ever wonder how a .zip le stores its data? How can they store all sorts
of data, and then later decompress it, without ever losing any of the contents?
It turns out that .zip les encode the data in two ways: some portions of
the le are simply stored directly, with no compression at all; other portions of
the le are listed as copies of data that came earlier in the le. This doesn’t
work for all le formats (for instance, compressing an already-compressed le
rarely accomplishes much), but for lots of dierent types of data (especially
text), it works very well.
In fact, this format works so well that this same basic format – with only
slight modications – is also used in .gz les (the standard compressed le
format for UNIX), and .png les (an image format that is widely used on the
web).
We’ll only be doing a tiny bit of the decoding process – and we won’t even
attempt to encode a data stream. But if you are curious, you can read about
the details of the format online: https://en.wikipedia.org/wiki/DEFLATE
4.1 How it Works
In the DEFLATE format, data is made up of a sequence of items, each of which
is one of the following:
• A single character
• A reference to previous data in the uncompressed stream. This reference
has both a distance and a length
For example, suppose that you wanted to encode the text
See Jack run. Jack runs up the hill. Jack and Jill roll down.
2
We won’t try to write an encoder for this assignment – but let’s consider it
byhand.
First, suppose that you noticed that the word Jack” showed up in the text
a number of times. You would leave the rst Jack” intact, but would replace
the second with a reference to previous text, like this:
See Jack run. (11,4) runs up the hill. Jack and Jill roll down.
(In this example, we see that the 2nd Jack” is 11 characters after the rst one,
and we wanted to copy 4 characters.)
If we were just a little smarter, we would notice that we could copy the
spaces before and after the word as well:
See Jack run. (11,6)runs up the hill. Jack and Jill roll down.
You also want to replace the 2nd Jack.” If you want, you can make it point at
the rst one:
See Jack run. (11,6)runs up the hill. (35,6)and Jill roll down.
or at the second one, which works just as well:
See Jack run. (11,6)runs up the hill. (24,6)and Jill roll down.
Of course, there are lots of other duplicates in the string:
See Jack run. (11,6)runs up the hill(24,8)and Jill roll down.
See Jack run. (11,6)runs up the hill(24,8)and Ji(16,2) roll down.
See Jack run. (11,6)runs up the hill(24,8)and Ji(16,2) ro(5,3)down.
See Jack run. (11,9)s up the hill(24,8)and Ji(16,2) ro(5,3)down.
4.2 How We’ll Represent the Encoding
We will represent a compressed data stream as a list of items. Each item will
be one of two things: either a single character, or a tuple which contains two
integers. The single character represents a single character in the output; the
tuple represents a backwards reference. So the compressed data above would
be encoded like this:
[‘S’, ‘e’, ‘e’, ‘ ‘,
‘J’, ‘a’, ‘c’, ‘k’,
‘ ‘, ‘r’, ‘u’, ‘n’,
‘.’, ‘ ‘, (11,9), ‘s’,
‘u’, ‘p’, … and so on …
3
4.3 Binary Encoding
Looking at the examples above, you may not think that this algorithm works
very well; after all, it seems like the backward references take as much space as
the original text – maybe even more! That’s true, in this simplied picture.
But in the real encoding, the information is packed very, very tightly, with
a lot of cool tricks which mean that a backward reference can be encoded in
a very small number of bytes. We don’t have space to discuss that here – but
trust me, it works very well!
4.4 Final Detail: Overlapped Copy Ranges
The following encodes a 10-character string, but it does it in a very strange way.
This is valid in the DEFLATE format – and we’ll allow it in this assignment as
well. What do you think would be the proper decompressed data?
[‘x’, (1,9)]
5 Required Function: unzip()
unzip() takes a compressed stream as its only parameter, and returns the
uncompressed string.
The compressed stream is encoded as described earlier in this spec; the
stream is a list, and each element is either a string (with a single character), or
a tuple with two integers.
5.1 asserts
This function must assert that:
• Every element in the compressed stream is the proper type (string or
tuple)
• Every string is exactly one character long
• Every tuple has exactly two elements, both of which are positive integers
• The oset” (the rst element in the tuple) does not point back any further
than the start of the uncompressed string.
6 Turning in Your Solution
You must turn in your code using GradeScope.
You must write a le named unzip.py. It must contain (at least) one func-
tion: unzip().
4
Any citation style (APA, MLA, Chicago/Turabian, Harvard)
Our guarantees
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
Money-back guarantee
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.