Tuesday 23 August 2016

A few comments on hash-codes

Apropos of my comments on databases, there is something that any extensive database must have, and that is a "guaranteed" unique key associated with the root entry for any set of related data - a unique account number, if you like.

There are, of course, ways and means to achieve something that approximates this and most of those means use things like MD5 hashes or Global Unique Identifier Numbers.

The latter is a 128-bit number expressed as 32 hexadecimal digits. They are generated with a small but finite chance of collision, but for most purposes, they are effectively unique.

But, the question, is how does one go about generating these numbers?

The answer is simply - complicated mathematical functions.

Now, for the purpose of having a unique identifier that is automatically generated and that will be associated with the root entry for each accession in our Database of Curiosities, and which will remain, no matter what else we change at a later date, it is possible to come up with a relatively simple way of hashing some kind of data (called a message digest).

In this case, the entire primary entry is combined with a date and time string and the (supposedly unique) accession number, and then mangled in order to produce a 128-bit code.

It doesn't matter that the code is meaningless to your eyes, it is a reference number that should be effectively unique in the world. If you then prefix that code with a hexadecimal representation of the accession number, then you have an excellent chance that there will be no collisions anywhere or at any time.

Since we are not using this method to generate data integrity checksums, nor are we using it to encrypt data, then there is no need for the complexity of the MD5 or GUID generation code.

I constructed this function in Excel for the purpose of testing, and so here is the code:

(apologies for the long lines)

' Simple hashing function to generate a unique record identifier which will not collide with other
' record identifiers.

Public Function HashRecord(strIn As String, strTime As String, strID As String) As String
    ' strIn -       The data string to be hashed.
    ' strTime -     A string representation of the date and time of the function call.
    ' strID -       A string representation of the accession number
   
    Dim numHash(18) As Integer
   
    strFeed = strTime & strID & strIn & strID & strTime
   
    For i = 0 To 18         ' explicitly zero the working array.
        numHash(i) = 0
    Next i
   
    byCarry = 0
   
    For i = Len(strFeed) To 1 Step -1
        strNext = Mid(strFeed, i, 1) ' extract the next character
        chrVal = Asc(strNext)        ' convert the character into an 8-bit ASCII value
       
        numHash(1) = ((numHash(1) * 8) + byCarry) Xor chrVal
            ' shift the binary bits of the number up 3 bits, add whatever
            ' carry value came from the last calculation and XOR with the
            ' next byte of the string to be hashed.
           
        byCarry = 0 ' the carry value
       
        If numHash(1) > 255 Then    ' calculate the carry, and the value of the first byte
            byCarry = Int(numHash(1) / 256)
            numHash(1) = numHash(1) And 255
        End If
       
        For j = 2 To 16             ' ripple the change up the entire array
            numHash(j) = (numHash(j) * 8) + byCarry
                byCarry = 0
                If numHash(j) > 255 Then
                    byCarry = Int(numHash(j) / 256)
                    numHash(j) = numHash(j) And 255
                End If
        Next j

    Next i

     If byCarry Then    ' if there is still a carry, then wrap it around and add it into the first byte.
         For j = 1 To 16
            numHash(j) = numHash(j) + byCarry
                byCarry = 0
                If numHash(j) > 255 Then
                    byCarry = Int(numHash(j) / 256)
                    numHash(j) = numHash1 And 255
                End If
        Next j
    End If
   
    hexHash = ""
   
    For j = 1 To 16 ' convert to hexadecimal (with leading zeros)
        hexHash = hexHash & Right("00" & Hex(numHash(j)), 2)
    Next j
    hexHash = Left(LCase(hexHash) & "0000000000000000000000000000000000000000000000000000", 32) 

' fix the length, and make the letters lower case
    hexHashX = Left(hexHash, 8) & "-" & Mid(hexHash, 9, 8) & ":" & Mid(hexHash, 17, 8) & "-" & Mid(hexHash, 26, 99) ' break up the string of digits
   
    hexHash = ""
    For i = 1 To Len(strID) ' convert the accession number to hexadecimal
        hexHash = hexHash & Right("00" & Hex(Asc(Mid(strID, i, 1))), 2)
    Next i
   
    hexHash = Right("0000000000000000" & hexHash, 16) ' just 8 characters to be used

    HashRecord = Left(hexHash, 8) & "-" & Right(hexHash, 8) & ":" & hexHashX
   
End Function




.
The output looks something like this:


00004D30-30303030:66b531da-92377f48:7d279dcd-6e7d107
The first two groups are the accession number (M00000), the remaining four groups are the digest.


Of  course, it is quite possible that I will simply end up using something much simpler.

No comments:

Post a Comment

Comments and suggestions are warmly welcomed.