Tuesday 23 August 2016

A few comments on hash-codes

Apropos of my comments on databases, there is something that any extensive database must have, and that is a "guaranteed" unique key associated with the root entry for any set of related data - a unique account number, if you like.

There are, of course, ways and means to achieve something that approximates this and most of those means use things like MD5 hashes or Global Unique Identifier Numbers.

The latter is a 128-bit number expressed as 32 hexadecimal digits. They are generated with a small but finite chance of collision, but for most purposes, they are effectively unique.

But, the question, is how does one go about generating these numbers?

The answer is simply - complicated mathematical functions.

Now, for the purpose of having a unique identifier that is automatically generated and that will be associated with the root entry for each accession in our Database of Curiosities, and which will remain, no matter what else we change at a later date, it is possible to come up with a relatively simple way of hashing some kind of data (called a message digest).

In this case, the entire primary entry is combined with a date and time string and the (supposedly unique) accession number, and then mangled in order to produce a 128-bit code.

It doesn't matter that the code is meaningless to your eyes, it is a reference number that should be effectively unique in the world. If you then prefix that code with a hexadecimal representation of the accession number, then you have an excellent chance that there will be no collisions anywhere or at any time.

Since we are not using this method to generate data integrity checksums, nor are we using it to encrypt data, then there is no need for the complexity of the MD5 or GUID generation code.

I constructed this function in Excel for the purpose of testing, and so here is the code:

(apologies for the long lines)

' Simple hashing function to generate a unique record identifier which will not collide with other
' record identifiers.

Public Function HashRecord(strIn As String, strTime As String, strID As String) As String
    ' strIn -       The data string to be hashed.
    ' strTime -     A string representation of the date and time of the function call.
    ' strID -       A string representation of the accession number
   
    Dim numHash(18) As Integer
   
    strFeed = strTime & strID & strIn & strID & strTime
   
    For i = 0 To 18         ' explicitly zero the working array.
        numHash(i) = 0
    Next i
   
    byCarry = 0
   
    For i = Len(strFeed) To 1 Step -1
        strNext = Mid(strFeed, i, 1) ' extract the next character
        chrVal = Asc(strNext)        ' convert the character into an 8-bit ASCII value
       
        numHash(1) = ((numHash(1) * 8) + byCarry) Xor chrVal
            ' shift the binary bits of the number up 3 bits, add whatever
            ' carry value came from the last calculation and XOR with the
            ' next byte of the string to be hashed.
           
        byCarry = 0 ' the carry value
       
        If numHash(1) > 255 Then    ' calculate the carry, and the value of the first byte
            byCarry = Int(numHash(1) / 256)
            numHash(1) = numHash(1) And 255
        End If
       
        For j = 2 To 16             ' ripple the change up the entire array
            numHash(j) = (numHash(j) * 8) + byCarry
                byCarry = 0
                If numHash(j) > 255 Then
                    byCarry = Int(numHash(j) / 256)
                    numHash(j) = numHash(j) And 255
                End If
        Next j

    Next i

     If byCarry Then    ' if there is still a carry, then wrap it around and add it into the first byte.
         For j = 1 To 16
            numHash(j) = numHash(j) + byCarry
                byCarry = 0
                If numHash(j) > 255 Then
                    byCarry = Int(numHash(j) / 256)
                    numHash(j) = numHash1 And 255
                End If
        Next j
    End If
   
    hexHash = ""
   
    For j = 1 To 16 ' convert to hexadecimal (with leading zeros)
        hexHash = hexHash & Right("00" & Hex(numHash(j)), 2)
    Next j
    hexHash = Left(LCase(hexHash) & "0000000000000000000000000000000000000000000000000000", 32) 

' fix the length, and make the letters lower case
    hexHashX = Left(hexHash, 8) & "-" & Mid(hexHash, 9, 8) & ":" & Mid(hexHash, 17, 8) & "-" & Mid(hexHash, 26, 99) ' break up the string of digits
   
    hexHash = ""
    For i = 1 To Len(strID) ' convert the accession number to hexadecimal
        hexHash = hexHash & Right("00" & Hex(Asc(Mid(strID, i, 1))), 2)
    Next i
   
    hexHash = Right("0000000000000000" & hexHash, 16) ' just 8 characters to be used

    HashRecord = Left(hexHash, 8) & "-" & Right(hexHash, 8) & ":" & hexHashX
   
End Function




.
The output looks something like this:


00004D30-30303030:66b531da-92377f48:7d279dcd-6e7d107
The first two groups are the accession number (M00000), the remaining four groups are the digest.


Of  course, it is quite possible that I will simply end up using something much simpler.

Monday 22 August 2016

The Curious Curator

Introduction: A little history

I have, in the past, made oblique reference to my main hobby which is, contrary to most beliefs, not based in computing or electronics.

I have to confess, here and now, that my passion is for minerals and all things mineralogical.

I live close to an area which has hosted all manner of metalliferous (and non-metalliferous) mining for centuries - an area with a rich industrial history, and even after a century of dereliction, a host of minerals to find and to collect.

For anyone who collects anything at all with more than a casual seriousness, an important part of that collection is the keeping of records.

Going from a simple stock-book to a simple electronic card-index took time (and lots of cramped fingers). Advancing from a card-index program to a database took even more time. The database I chose was what was available to me - Microsoft Access, as a part of the Office 97 suite.

That edition of Access is now getting somewhat long in the tooth, now, and is not completely happy running under Windows 7 - I cannot, though, justify spending money on a more recent copy of Access. I can easily justify spending time and effort on migrating to a new database.

Over the past four years, I have played with ideas, fiddled with software and tried stuff out, with the following results -

  • The available Database Management Systems are many and varied - none even approximately capable of the visual form/report design capabilities of Access (forms that require zero programming!)
  • The available database systems, while excellent, are designed for vast amounts of data presented in a strict and inflexible format. Most of them are Relational Database systems.
  • The available database systems store data in table files that are not, at any stretch of the imagination, human readable or even human friendly.
  • Losing a database file will, unless properly backed up, means losing the data within the file - in its entirety.
  • Losing the software in an upgrade, likewise, means losing access to your data.
  • Most database systems can reference external files (documents), and some may even incorporate those documents within the database itself - in their own, human-incompatible format.
My answer to this is a novel database that uses the file-structure of the storage medium (disc drive) to organise text files that contain various indices and data sets (as documents), much as physical documents would be stored in a museum's collection records.

Thus we have the first inkling of the
Curious Curator for a Cabinet of Curiosities
.

The logo is a crown, in fact one drawn well over a century ago by John Tenniel as a part of an illustration in a children's story book.

Why a crown, and that one in particular? A favourite quotation of the young girl who wore it.

"Curiouser and curiouser," said Alice.

The whole tenet of the program is that each and every document is stored in a manner that would allow any system accessing the data to be able to retrieve it without recourse to any particular piece of software in order to extract that information from the file.

Other than media files (images, video etc.), everything is stored as plain text.

Whilst this is not the most efficient manner of storing data, either space-wise or for access speed, it is robust in the extreme. Damage to a single file loses one record-card's worth of data (which may be able to be recovered in seconds from a master table - which is not easily accessible to the human reader)

Documents, file cards, image galleries and so on could be simply constructed using simple tools and text files.

While the software is still in the planning stage, I have broken ground on it - having decided on the programming language, platform and delivery medium. I have also managed to get together in my mind the various tools and techniques that will be brought to bear.

At some point, the project will be available for download, comment, testing and piracy from a project page on SourceForge.

And, as most things I do are - the tools and software I will be using are all open source, as will be the Curious Curator Database.

https://sourceforge.net/projects/curious-curator/

Just because I happen to like the metaphor of filing cabinets, card indexes and manila folders in an old museum office, here is a quick and dirty, preliminary mock-up of one of the pages I have planned ...


And just in case you don't understand the reference to the Cabinet of Curiosities - that is the early name given to private collections of objects (curiosities or curios), which 'cabinets' (often suites of rooms given over to them) eventually became the museums we know today.