Results 1 to 15 of 15

.big file format described in detail (analysis complete)

  1. #1
    ThomasT
    Guest

    .big file format described in detail (analysis complete)

    To all the Homeworld 2 community,

    Administrator Moe has cleared my posting of this (and similar information) on the forum. What follows is a detailed description of the .big file format. All fields have been identified and analyzed, so the description should be sufficient to create archives.

    Code:
    All indexes, offsets, and counts are little-endian and require
    conversion for the Mac PowerPC architecture, but are ok as-is on the
    Intel platform.  Items that are numeric and described as 4 bytes are of 
    type uint32_t.  Items that are numeric and described as 2 bytes are of
    type uint16_t.
    
    Overall format is:
        Archive Header
        Section Header describing the four sections immediately
            following the Archive Header (TOC List,
            Folder List, File Info List, and File Name List)
        TOC (Table of Contents) List
        Folder List
        File Info List
        File Name List
        File Data for all the files (including the 264 byte header
           preceeding the file data of each file)
    
    The format of each of the above is:
        
    180 byte archive header
    	8 bytes of "_ARCHIVE"
    	4 bytes version
    	16 bytes for MD5 tool signature of archive (MD5 of tool security
                 key and full file data excluding the archive header)
    	128 bytes for 64 utf16 chars for archive name
    	16 bytes for MD5 signature of archive (MD5 of HW2 Root Security Key
                 and archive header data)
    	4 bytes section header size
    	4 bytes exact file data offset
    
    24 byte section header consisting of four 6 byte sections.
    Each 6 byte section has:
    	4 byte offset relative to archive header
    	2 byte count
    
    The four sections are:
    	TOC List (describes each TOC entry, that is, each folder hierarchy)
    	Folder List (describes the folder hierarchy for each TOC)
    	File Info List (describes each file)
    	File Name List (the list of file names, including folder names)
    
    TOC list entry (138 bytes)
    	64 character alias name
    	64 character name
    	2 byte first folder index
    	2 byte last folder index
    	2 byte first filename index
    	2 byte last filename index
    	2 byte start folder index for hierarchy
    
    Folder list entry (12 bytes)
    	4 bytes file name offset (relative to file name list offset)
    	2 bytes first subfolder index
    	2 bytes last subfolder index
    	2 bytes first filename index
    	2 bytes last filename index
    
    File info list entry (17 bytes)
    	4 bytes file name offset (relative to file name list offset)
    	1 byte flags (0x00 if uncompressed
                          0x10 to decompress during read -- used for large files
                          0x20 to decompress all at once -- used for small files, like .lua files)
    	4 bytes file data offset (relative to overall file data offset)
    	4 bytes compressed length
    	4 bytes decompressed length
    
    File header preceding file data for each file (264 bytes)
    	256 chars for file name
            4 bytes file modification date
            4 bytes CRC of uncompressed file data.
    
    Note that the file data offset in the file info list entry indicates the
    location of the file data.  In order to access the file header
    preceeding the file data you must subtract 264 from the offset.
    
    The HW2 Root Security Key is an ASCII string that is passed first to
    the MD5 algorithm followed by the archive header data to create the
    archive's 128 bit (16 byte) MD5 signature.  The MD5 algorithm used is
    standard.  The Root Security Key is embedded in the HW2 application
    and also in Relic's archive tool.
    
    The tool security key is an ASCII string that is passed first to the MD5
    algorithm followed by the full data in the archive excluding the archive
    header to create the archive's 128 bit (16 byte) MD5 tool signature.
    The MD5 algorithm is standard.  The tool security key is embedded in
    Relic's archive tool.
    
    The file modification date appears to be the number of seconds
    since UTC 00:00:00 January 1st, 1970.  This date is the Unix epoch,
    although it is unknown to the author of this document if that is also
    the Windows epoch.
    
    The CRC algorithm used to calculate the uncompressed file data CRC
    is the exact same algorithm used for Homeworld.  Apparently the algorithm
    and table are taken from the 32-Bit CRC International Standard,
    which is based on a particular mathematical formula.  Thus there shouldn't
    be any concerns over copyright in this case.
    Last edited by ThomasT; 16th May 06 at 8:42 AM. Reason: Added link to BIGaddendum.zip

  2. #2
    My code isn't as clean as yours but I'll post up any changes tommorow when I've had some sleep.

  3. #3
    ThomasT
    Guest

    Continuing with analysis

    Delphy, you haven't responded in a few days, so I've continued my analysis. My suspicion is that the original unknowns in my file spec have never been figured out by anyone. The reason is because I found more information on the internet that indicates others also have marked the information as unknown. Two links are http://www.relic.com/rdn/wiki/DOWFileFormats/SGA and http://www.watto.org/extract/program/archives.txt.

    Both of the above URLs describe the .sga file format used by Dawn of War. That format is almost identical to the Homeworld 2 .big file format. Both of the above referenced pages mark the same items as unknown as I did. Furthermore, I found a thread someplace in the Relic forums (sorry, I lost the ref) discussing the .sga format. The thread mentioned Spooky's description of the format, and in fact the Relic forum page referenced above contains apparently Spooky's understanding of the format. Assuming the spec was his most up-to-date one (and no one has posted anything more recent), it is fair to assume that even Spooky didn't understand the missing information (or was unwilling to share it).

    Suffice it to say, I am continuing with my analysis.
    Last edited by ThomasT; 5th Mar 06 at 1:08 PM.

  4. #4
    The 22nd Hyperspace Core Corsix's Avatar
    Join Date
    Sep 2004
    Location
    Oxford
    The formats on the SGA page are my VB code, which I have recently ported to C++. This is Spooky's VB code (hopefully he won't mind me posting it):
    Code:
    Public Type tagArchiveHeader
       sIdentifier As String * 8
       lVersion As Long
       dUnknown(0 To 3, 0 To 3) As Byte
       sArchiveType As String * 128
       dCRC(0 To 3, 0 To 3) As Byte
       lDataHeaderSize As Long
       lDataOffset As Long
    End Type
    
    Public Type tagRecordHeader
       lTOCOffset As Long
       nNumTOC As Integer
       lDirectoyOffset As Long
       nNumDirectories As Integer
       lFileOffset As Long
       nNumFiles As Integer
       lItemsOffset As Long
       nNumItems As Integer
    End Type
    
    Public Type tagTOCHeader
       sTOCAlias As String
       sTOCStartName As String
       nTOCStartDir As Integer
       nTOCEndDir As Integer
       nTOCStartFile As Integer
       nTOCEndFile As Integer
       nTOCFolderOffset As Integer
    End Type
    
    Public Type tagDirectoryOffsets
       lSubFolderOffset As Long
       nFirstSubfolder As Integer
       nNextSubFolder As Integer
       nFirstFile As Integer
       nNextFile As Integer
       sDirectoryName As String
    End Type
    
    Public Type tagFileOffsets
       lOffset As Long
       dICFiller As Byte
       lDoWFiller As Long
       lPosition As Long
       lCompressedSize As Long
       lOriginalSize As Long
       sFileName As String
    End Type
    
    Public Type tagArchive
       ArchiveHeader As tagArchiveHeader
       RecordHeader As tagRecordHeader
       TOCHeader() As tagTOCHeader
       DirectoryOffsets() As tagDirectoryOffsets
       FileOffsets() As tagFileOffsets
       DirectoryNames() As String
       FileNames() As String
    End Type
    He thinks its a 16 byte CRC in the header, but there are functions in DoW's filesystem.dll that relate to SGAs and MD5s so I suspected it was some kind of MD5. As for the unknowns, I don't know what they are, and if Spooky did then he didn't tell mewhen i asked.

    I would like to know what the unknowns are though so I can craft SGA archives.

  5. #5
    I have it all as unknowns in my code, and yes, the DoW/IC SGA format is very similar to the HW2 ".big" format - it's basically the same with a couple of changes.

  6. #6
    ThomasT
    Guest
    Corsix and Delphy, thanks for your responses. You've confirmed what I suspected. I will continue analyzing things and keep you all informed. Again, thanks for your support!

    Update: I've updated the compression flags to include the 0x10 value, and to clarify what the 0x20 value means. Also, I have figured out the CRC field in the file header. Lastly, I've figured out what the other 16 bytes in the archive header mean.

    The last little bit I'm struggling to understand is the meaning of the 4 bytes just before the CRC. The pattern of these 4 bytes is odd, more akin to flags or some kind of index. Hopefully I'll have this figured out without too much trouble, because then the entire format will have been analyzed.
    Last edited by ThomasT; 8th Mar 06 at 10:58 PM.

  7. #7
    The 22nd Hyperspace Core Corsix's Avatar
    Join Date
    Sep 2004
    Location
    Oxford
    As far as "big" and "small" files, Relic seem to draw the line at 4k:
    The biggest file I found with the 0x20 flag was 4,080 bytes (16 bytes under 4k)
    The smallest file I found with the 0x10 flag was 4,106 bytes (20 bytes over 4k)
    As for the 0x00 flag, it doesn't "get toggled" at a specific value, so I assume it's used when zlib cannot do any compressing

  8. #8
    ThomasT
    Guest

    Analysis Complete!

    I don't quite know how to say this, except to just say it. The analysis of the .big file format is now complete. Thank God (truly)! I think I received some divine help to complete this, sigh.

    The final part I didn't understand, namely the 4 bytes just after the file header's 256 byte array for the file name, represents the modification date for the file. I believe it is the number of seconds since UTC 00:00:00 January 1st, 1970. The numbers look about right, and also that is the epoch for Unix systems. (I'm not sure if it is the same for Windows systems.)

    I kept staring at those numbers, trying to figure out the relationship between them and the files.


    Then it suddenly dawned on me that the numbers were very close together, with very little variance. I also remembered reading somewhere that the original Homeworld had a modification date embedded in the format.

    Although the .big file format has changed from HW to HW2 to DoW, there has been a pattern of evolution evident in the format changes. So, on a hunch I got on a Windows system and used ModPackager to build a test archive twice, updating the modification time of one file the second time (and refreshing files in ModPackager). I then copied the two archives to a Unix system and did an od (octal dump) of the archives, and voila! The bytes I didn't understand in the file header were the ones that were different! I then wrote a quick C program to dump the current date out in seconds since the epoch, and the changed bytes in the archive were very very close to that value.

    Ok. Now I need to take a little break while I plan the creation of the Mac tool for extracting/creating archives.
    :snore:
    Last edited by ThomasT; 24th May 06 at 8:56 AM. Reason: Added bigar thread reference

  9. #9
    The 22nd Hyperspace Core Corsix's Avatar
    Join Date
    Sep 2004
    Location
    Oxford
    I also understand the DoW SGA format in its entirety, so I'll update the DoW wiki and then integrate SGA creation into my current project.

    PS. Shall we post the security keys?

  10. #10
    ThomasT
    Guest
    Corsix, let's hold off posting the security keys for the moment. I still feel uncomfortable publishing that kind of information. Also, anyone who knows their way around a debugger or can get the strings out of an executable will find the keys easily.

  11. #11
    Lost in the code... Mikali's Avatar
    Join Date
    Jun 2003
    Location
    %HW2_ROOT%
    This thread should be moved to Archive Dump and stickied.
    Download my HW2 mods, maps & tools. link
    Username|SF on Gamespy/Xfire/Hamachi/Gameranger

  12. #12
    ThomasT
    Guest

    BIG Archive Format Miscellaneous Notes

    What follows are some miscellaneous notes I've accumulated while building the MacOS X archive tool.

    The Relic archive tool does not lay out file data in canonical order during archive updates. Specifically, this makes it virtually impossible to extract/rearchive the stock Homeworld 2 hw2data archive and have the result be byte-for-byte identical. This is because the stock hw2data archive appears to have been updated multiple times during construction. This does not cause a problem with game execution, but it does make it impossible to absolutely verify that an extract/rearchive is completely correct.

    The BIG file format has several fixed character arrays. Some investigation is necessary to see if these various arrays are null-terminated (ending with '\0') or not. The specific fields are the archive name (64 UTF-16 characters), the 256 character file name array in the file header preceding each file data block, and the two 64 character arrays for the archive name and archive alias. Understanding these limits is important when building 3rd party archive tools. For now it is safest to assume that they all are null-terminated. Note that these kinds of limits are the typical breeding ground for buffer overflow problems.

    The 256 character file name array in the file header appears to be completely useless and a large waste of space, in case anyone was wondering. The format creators went to great lengths to reduce the amount of space used by the various file and folder pathnames, so it is odd that this per file 256 character file name array exists at all since it wastes so much space and does not affect anything else in the archive. Perhaps it is used algorithmically by the HW2 engine in some way that an archive program does not. If anyone has any ideas on this, let me know.

    [Mikail: Who would move this thread to Archive Dump? Would it be me, or someone else? If it's me then someone will need to explain to me how to do that. Also, if a thread is moved, does it interfere with any links from other threads to the moved thread?]
    Last edited by ThomasT; 27th Apr 06 at 8:59 AM.

  13. #13
    Lost in the code... Mikali's Avatar
    Join Date
    Jun 2003
    Location
    %HW2_ROOT%
    One of the moderators would...
    Looking again at the forum description, I guess this is the right place for this thread.

  14. #14
    sanityflare
    Guest
    this may be of use to you guys. I dont remember where I got this as its been on my hd for years but may help.

    Code:
    .BIG file specification addendum
    By B1FF (lmoloney@relic.com)
    
    The article listed on RelicNews is pretty complete WRT the .BIG file format.  It was really neat to download the program for viewing and extracting the contents of a bigfile.  We will probably release our bigfile creation program but you will note that our version of the ‘extract’ command was never finished.  Oh the pains of finalling!  
    
    The only thing that was not pick up on was the CRC’s of the bigfile.  The CRC is an 8-byte CRC actually made up of 2 standard 32-bit CRC’s.  Included is some sample code to create these CRC’s.  I think I originally copied this code from Graphics Gem’s several games ago.  It’s pretty standard.  Make note of this algorithm.  It is also used in the .CRC format.
    
    udword CRCTable[] =
    {
         0x00000000,0x77073096,0xEE0E612C,0x990951BA,
         0x076DC419,0x706AF48F,0xE963A535,0x9E6495A3,
         0x0EDB8832,0x79DCB8A4,0xE0D5E91E,0x97D2D988,
         0x09B64C2B,0x7EB17CBD,0xE7B82D07,0x90BF1D91,
         0x1DB71064,0x6AB020F2,0xF3B97148,0x84BE41DE,
         0x1ADAD47D,0x6DDDE4EB,0xF4D4B551,0x83D385C7,
         0x136C9856,0x646BA8C0,0xFD62F97A,0x8A65C9EC,
         0x14015C4F,0x63066CD9,0xFA0F3D63,0x8D080DF5,
         0x3B6E20C8,0x4C69105E,0xD56041E4,0xA2677172,
         0x3C03E4D1,0x4B04D447,0xD20D85FD,0xA50AB56B,
         0x35B5A8FA,0x42B2986C,0xDBBBC9D6,0xACBCF940,
         0x32D86CE3,0x45DF5C75,0xDCD60DCF,0xABD13D59,
         0x26D930AC,0x51DE003A,0xC8D75180,0xBFD06116,
         0x21B4F4B5,0x56B3C423,0xCFBA9599,0xB8BDA50F,
         0x2802B89E,0x5F058808,0xC60CD9B2,0xB10BE924,
         0x2F6F7C87,0x58684C11,0xC1611DAB,0xB6662D3D,
    
         0x76DC4190,0x01DB7106,0x98D220BC,0xEFD5102A,
         0x71B18589,0x06B6B51F,0x9FBFE4A5,0xE8B8D433,
         0x7807C9A2,0x0F00F934,0x9609A88E,0xE10E9818,
         0x7F6A0DBB,0x086D3D2D,0x91646C97,0xE6635C01,
         0x6B6B51F4,0x1C6C6162,0x856530D8,0xF262004E,
         0x6C0695ED,0x1B01A57B,0x8208F4C1,0xF50FC457,
         0x65B0D9C6,0x12B7E950,0x8BBEB8EA,0xFCB9887C,
         0x62DD1DDF,0x15DA2D49,0x8CD37CF3,0xFBD44C65,
         0x4DB26158,0x3AB551CE,0xA3BC0074,0xD4BB30E2,
         0x4ADFA541,0x3DD895D7,0xA4D1C46D,0xD3D6F4FB,
         0x4369E96A,0x346ED9FC,0xAD678846,0xDA60B8D0,
         0x44042D73,0x33031DE5,0xAA0A4C5F,0xDD0D7CC9,
         0x5005713C,0x270241AA,0xBE0B1010,0xC90C2086,
         0x5768B525,0x206F85B3,0xB966D409,0xCE61E49F,
         0x5EDEF90E,0x29D9C998,0xB0D09822,0xC7D7A8B4,
         0x59B33D17,0x2EB40D81,0xB7BD5C3B,0xC0BA6CAD,
    
         0xEDB88320,0x9ABFB3B6,0x03B6E20C,0x74B1D29A,
         0xEAD54739,0x9DD277AF,0x04DB2615,0x73DC1683,
         0xE3630B12,0x94643B84,0x0D6D6A3E,0x7A6A5AA8,
         0xE40ECF0B,0x9309FF9D,0x0A00AE27,0x7D079EB1,
         0xF00F9344,0x8708A3D2,0x1E01F268,0x6906C2FE,
         0xF762575D,0x806567CB,0x196C3671,0x6E6B06E7,
         0xFED41B76,0x89D32BE0,0x10DA7A5A,0x67DD4ACC,
         0xF9B9DF6F,0x8EBEEFF9,0x17B7BE43,0x60B08ED5,
         0xD6D6A3E8,0xA1D1937E,0x38D8C2C4,0x4FDFF252,
         0xD1BB67F1,0xA6BC5767,0x3FB506DD,0x48B2364B,
         0xD80D2BDA,0xAF0A1B4C,0x36034AF6,0x41047A60,
         0xDF60EFC3,0xA867DF55,0x316E8EEF,0x4669BE79,
         0xCB61B38C,0xBC66831A,0x256FD2A0,0x5268E236,
         0xCC0C7795,0xBB0B4703,0x220216B9,0x5505262F,
         0xC5BA3BBE,0xB2BD0B28,0x2BB45A92,0x5CB36A04,
         0xC2D7FFA7,0xB5D0CF31,0x2CD99E8B,0x5BDEAE1D,
    
         0x9B64C2B0,0xEC63F226,0x756AA39C,0x026D930A,
         0x9C0906A9,0xEB0E363F,0x72076785,0x05005713,
         0x95BF4A82,0xE2B87A14,0x7BB12BAE,0x0CB61B38,
         0x92D28E9B,0xE5D5BE0D,0x7CDCEFB7,0x0BDBDF21,
         0x86D3D2D4,0xF1D4E242,0x68DDB3F8,0x1FDA836E,
         0x81BE16CD,0xF6B9265B,0x6FB077E1,0x18B74777,
         0x88085AE6,0xFF0F6A70,0x66063BCA,0x11010B5C,
         0x8F659EFF,0xF862AE69,0x616BFFD3,0x166CCF45,
         0xA00AE278,0xD70DD2EE,0x4E048354,0x3903B3C2,
         0xA7672661,0xD06016F7,0x4969474D,0x3E6E77DB,
         0xAED16A4A,0xD9D65ADC,0x40DF0B66,0x37D83BF0,
         0xA9BCAE53,0xDEBB9EC5,0x47B2CF7F,0x30B5FFE9,
         0xBDBDF21C,0xCABAC28A,0x53B39330,0x24B4A3A6,
         0xBAD03605,0xCDD70693,0x54DE5729,0x23D967BF,
         0xB3667A2E,0xC4614AB8,0x5D681B02,0x2A6F2B94,
         0xB40BBE37,0xC30C8EA1,0x5A05DF1B,0x2D02EF8D,
    };
    
    /*=============================================================================
        Functions:
    =============================================================================*/
    /*-----------------------------------------------------------------------------
        Name        : crc32Compute
        Description : Compute a 32-bit CRC
        Inputs      :
        Outputs     :
        Return      :
    ----------------------------------------------------------------------------*/
    crc32 crc32Compute(ubyte *packet, udword length)
    {
       udword index, tableIndex;
       crc32  crc;
    
       crc = 0xffffffff;
       for (index = 0; index < length; index++)
       {
          tableIndex = (crc ^ *(packet++)) & 0x000000FF;
          crc = ((crc >> 8) & 0x00FFFFFF) ^ CRCTable[tableIndex];
       }
       return(~crc);
    }
    
    The first CRC is the first half of the file name and the second CRC is the second half of the CRC.  Why do such a silly scheme?  It makes it easy to sort the TOC by CRC and do a binary search for a filename.  This makes for faster lookups.  All file requests in our file layer are resolved from the text name to an 8-byte CRC.
    
    As for some unknown data members, the header_unknown member you refer to is always 1.  A bit redundant?  Yes.  The toc_unknown[1..3] can be ignored.  They’re padding that is cleared to something by the compiler.

  15. #15
    ThomasT
    Guest

    BIG file CRC code in Bigaddendum.zip

    Thanks, SnakeChiken, for the information. I have found where this information is from. It is in BIGaddendum.zip.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

     

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •