Chapter 1 Data Representation

1.01 Binary number system

二进制和十进制的来源

十进制的来源

人类计算使用十进制，可能是跟人类的手指有10根有关。数学来源于计数，最原始的方法就是数手指头。亚里士多德称人类普遍使用十进制，只不过是绝大多数人生来就有10根手指这样一个解剖学事实的结果。

古罗马数字5进制

玛雅人的20进制

其他进制

二进制——最简单，但是使用起来较复杂

八进制——

十二进制——一年十二个月份，

十六进制——十六两合一斤

二十进制——玛雅人使用，手和脚一共有20根，来进行计数

二进制的来源

早期设计的机械计算装置中,使用的不是二进制，而是十进制或者其他进制，利用齿轮的不同位置表示不同的数值，这种计算装置可能更加接近人类的思想方式。比如说一个计算设备有十个齿轮，它们级连接起来，每一个齿轮有十格，小齿轮转一圈大齿轮走一格。这就是一个简单的十位十进制的数据表示设备了，可以表示0到999999999的数字。配合其他的一些机械设备，这样一个简单的基于齿轮的装置就可以实现简单的十进制加减法了。

电子计算机出现以后，使用电子管来表示十种状态过于复杂，所以所有的电子计算机中只有两种基本的状态，开和关。

为什么要用二进制

电路中容易实现：当计算机工作的时候，电路通电工作，于是每个输出端就有了电压。
用二进制表示数据具有抗干扰能力强，可靠性高等优点。因为每位数据只有高低两个状态，当受到一定程度的干扰时，仍能可靠地分辨出它是高还是低。
便于逻辑判断（是或非）。适合逻辑运算：逻辑代数是逻辑运算的理论依据，二进制只有两个数码，正好与逻辑代数中的“真”和“假”相吻合。

所有信息在计算机中都是二进制表达

The word 'Hello' is stored as the binary combination of 0100100001100101011011000110110001101111

1.02 Number Systems

Key Terms

Data - numbers, symbols or alphanumeric characters in their raw format before processing.
Analogue - this is the smooth stream of data that our senses process on a daily basis, such as a sound wave.
Digital - data represented in the values 1 and 0 that a computer can process.
Denary - a system of numbers with a base of 10.Each unit used increases by the power of 10
Binary – a system of numbers with a base of 2. Each unit used increases by the power of 2.

Converting between denary and binary

binary_and_denary

Overflow

overflow error is an indication that a number is too big to be stored in the computer using.

Logical binary shifts

Logical shifts involve shifting (moving) bits to the
left (multiplying by 2 for each shift) or the
right (dividing by 2 for each shift). If shifting to the left or right results in a loss of 1-bits, then this would result in an error.

One's complement

对于小学生来说，会做5-3，但是不会做3-5。后续我们就引入了负数的概念

3-5=3+[-5] = [-2] 中括号的数代表“反码“

计算机的数字电路只有加法器，没有减法器。既然可以用反码来做减法，所以不需要单独来设计减法器

3 = [0_0000011] 5 = [0_0000101] 符号位负数 -5 = [1_0000101] 反码 -5 = [1_11111010]

3 +[-5] =[-2] [0_0000011]+[1_1111010]=[1_11111101]

为什么可以实现：

实际上反码可以理解为 -5 = [1_1111010] = 255 - 5

[3] + [-5] = 3 + 255 – 5 = 253 = 255 -2 = [-2]

[7] + [-5] = 7 + 255 – 5 = 255 + 2 = [2]

反码的问题：

[0_00000000] ~ +0 和 [1_11111111] ~ -0 所以存在两个0，这样在计算中是没有必要的

Two's complement

因为0 这个特殊的数字存在。0既不是正数也不是负数。

对于反码来说正数 +0 ~ +127 负数 -127 ~ -0

所以会出现两个0:

[0_0000000] +0

[1_1111111] -0

但是对于计算机来说任何数字都只能有一个编码。所以把负数整体向后移动一位，这样范围就变成 -128 ~ -1 0 ~ 127。将这个反码+1 称为补码

补码：正数的补码保持不变 3 = [0_0000011]

负数先求反码，然后再加 1 - 5 = [1_1111010] + 1 = [1_1111011]

3 + [-5] = [-2] [0_0000011] + [1_1111011] = [11111110]

Binary coded decimal （depreciated）

This is useful in applications that require single denary digits to be stored or transmitted.

The BCD code uses a nibble to represent a denary digit.

1.03 Text, Sound and Images

Key Items

character – text, numbers and symbols, for example each key on a keyboard.

ASCII Code

In AscII each character will take 1 byte of storage space as it is made up of 8 bits.

The 7-bit version of the code (often referred to as US ASCII) was standardised many years ago by ANSI (American National Standards Institute).

Unicode

In Unicode a character takes up 2 bytes as it is made up of 16 bits.

It should be noted that Unicode codes have been developed in tandem with the Universal Character Set (UCS) scheme, standardised as ISO/IEC 10646

Note that for the two-byte, three-byte and four-byte representations all continuing bytes have the two most significant bits set to 10.

编码集推荐阅读 -> 从ASCII码->Unicode->UTF-8历史变迁，及其差异

beyond 8-bit binary

Measuring memory size

Byte B
Kilobyte KB
Megabyte MB
Gigabyte GB
Terabyte TB
Petabyte PB

Using binary in computer registers

Register - small piece of memory where values can be held

A register is a small piece of memory built into the central processing unit (CPU) of a computer system where values and instructions are temporarily held.

small in capacity
extremely fast read and write rate
data can be written to and read from a register much quicker than from primary memory or secondary storage

Uses of the hexadecimal system

Hexadecimal – a system of numbers with a base of 16. Each unit used increases by the power of 16.

Debug – finding and fixing problems and errors in a program.

Computers can not actually process hexadecimal, they convert it into binary before processing it.

Programmers work with hexadecimal as it is easier for humans to read than binary. Computers convert hexadecimal data into binary before processing it.

Error codes
Media Access Control (MAC) address
Internet Protocol (IP) address
Hypertext mark-up language (HTML) colour codes

html color

Mac Address

Media Access Control(MAC) address are 12-digit hexadecimal numbers that uniquely identify each different device in a network.

00-1B-6384-45-E6

IP address

Text, numbers and symbols

ASCII

In AscII each character will take 1 byte of storage space as it is made up of 8 bits.
In Unicode a character takes up 2 bytes as it is made up of 16 bits.
UTF-8

Pictures

Image represented by a binary value. "bit-plane". 1-bit would give us 2 colors, 2-bits would give us 4 colors, 3-bits would give us 8 colors.

1-bit picture

2-bits

Color depth – how many bits represent each pixel
Resolution - Width & Height (in pixels)

RGB color

256 colors

RGB and Binary

gray image

Sound

The sound we hear is also analogue.computers work digitally and can only process binary.

Sound is recorded at set timed intervals; this process is known as sampling.

analogue
analogue

Calculation of file size

image file prefix::

image resolution (number of pixels) × colour depth (in bits)

image with pixel 4160*3120::

R ~ 8 bits, G ~ 8 bits, B ~ 8 bits

RGB ~ 3 Byte

image file size:: 4160 * 3120 * 3 Byte = 38,937,600 Byte / 1024 = 38025 KiB /1024 = 37.13 MiB

1.04 Data storage and file compression

key terms

Algorithm - a step-by-step set of instructions

Data compression is done by using compression algorithms that manipulate the data.

feature::

less storage space
file will easier to transmit from one device to another

Measurement of data storage

decimal prefix

binary prefix

compression

Lossy compression

The data is removed permanently, so it is effectively ‘lost’. This way the size of the file is reduced.

mostly used for multimedia such as audio, video and image files

Lossless compression

Lossless refers to a method of compression that loses no data in the process.

the compressed data can be reversed to reconstruct the data file exactly as it was

Run length encoding(RLE)

The original image (8 × 8 square) would need three bytes per square (to include all three RGB values). Therefore, the uncompressed file for this image is::

8 × 8 × 3 = 192 bytes.

The RLE code has 92 values, which means the compressed file will be 92 bytes in size.

example::

WHEN IT IS SNOWING HEAVILY LOOK OUTSIDE. LOOK OUTSIDE IT IS SNOWING HEAVILY.

Image Format

File formats

A file format is the method that we choose to store different data on a computer.