For a few years now, I've always kept my IM archives. I didn't really have a purpose, I just thought that it might be fun to one day look back and see what kind of discussions I had. Well, now I have 150 Mb of logs from Digsby, Trillian and Pidgin and there is no way I'm ever going to read that again. But in light of a few things I learned recently (the Coursera NLP and ML courses) I am going to try to visualize and analyze my archives in a mathematical way. That's right, I'm reducing you to numbers. :D. At least what we've discussed :P
The first step to do this, is to extract the relevant information from the various serializations employed by Pidgin, Trillian and Digsby and put them in a centralized place. The final serialization format will be YAML, because it is human readable and maps straightforward to Python's (and other languages') native types.
Lets analyze the structure of a typical Trillian log. The main folder contains separate folders for each protocol you use. Each protocol folder contains one folder "Query", and, if you group chats, a folder called "Channel". I'm going to ignore this one, because I only have two group chat archives. The Query folder contains a folder assets, which has the files you sent or received, and three files for each contact:
contact_name.xml. The first one contains... some numbers. I don't know what they mean. The second one contains all the conversations, but without timestamps. The last one is what we need. All the conversations in XML format, including the timestamps.
Lets see how the XML file looks like.
The beginning of a conversation is marked by a session tag:
<session type="start" time="1310910034" ms="168" medium="YAHOO" to="thebestcatalin" from="rolisz"/>A message is appropiately marked in a message tag:
<message type="outgoing_privateMessage" time="1310910050" ms="903" medium="YAHOO" to="thebestcatalin" from="rolisz" from_display="rolisz" text="whaaaaa%27%3F"/> <message type="incoming_privateMessage" time="1310910100" ms="571" medium="YAHOO" to="rolisz" from="thebestcatalin" from_display="thebestcatalin" text="%3D%20%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29%29"/>
We also have tags for the end of a conversation (I presume this means the closing of a window) and for file transfers. We don't care about those.
From the message tag we will need the time, from and text attributes. The proper way to do this would be to parse the XML. But that's too complicated for such a simple task. We know that all the messages will be in this format so we can parse them with some regular expression.
f = open(file,encoding="utf-8") name = re.match("Trillian\\logs\\(.+?)\\Query\\(.+?).xml",file) dest = open("logs\"+name.groups()+".txt","a",encoding="utf-8") for line in f.readlines(): match = re.match('',line) if match: dest.write(yaml.dump([datetime.datetime.fromtimestamp(int(match.groups())).isoformat(sep=" "), urllib.parse.unquote(match.groups()), urllib.parse.unquote(match.groups())], default_flow_style=False,explicit_start=True,allow_unicode=True))
We are opening both the opening file and the destination file explicitly as UTF-8. We get the name of the contact by matching a regular expression agains the name of the file, where the second capturing group is going to contain the Yahoo/GTalk/Facebook id of the contact.
We then loop over all the lines in the file and apply that pretty little regex to every line. If the line matches, we use the second, third and fourth capturing groups to get the timestamp, sender id and the actual text of the message. The timestamp we convert to a date and then to a string. The name and text of the message is URL-encoded so we have to unquote it using the
urllib library. And all is set to be written to the file in YAML format. Again, we have to explicitly tell the YAML serializer to allow Unicode characters.
Now lets take a quick look at the format of a Digsby file. This one is slightly neater, but it has more fluff around it.
The folder structure is
digsby_user/protocol/protocol_id/contact_id/date.html. Each file is a standalone html file that can be opened with a browser. The fluff is made out of the tags necessary to make the document HTML compliant, a few stylings and a few scripts.
The lines that are relevant for us look like this:
<div class="outgoing message" auto="False" timestamp="2009-05-03 12:40:35"><span class="buddy">rolisz</span> <span class="msgcontent"><span style="color: #000000; font: 10pt arial;">ai auzit ca o ajuns linux la 1%?</span></span></div>
Our code is very similar to the previous one, except the regex is a bit different:
<div class=".+? message" .+? timestamp="(.+?)"><span class="buddy">(.+?)</span> <span class="msgcontent">(<span style=".+?">)?(.+?)</span>. Of course the matching groups have to be adjusted as well.
The last log file we will look at is the one made by Pidgin. Dear Pidgin. Pidgin's folder structure is
logs/protocol/protocol_id/contact_id/datetime.html. Pidgin uses very unsemantic and outdated code. Inline styles, font tags, br's for newlines, b's for styling. Awful. But it's easy to parse:
<font color="#A82F2F"><font size="2">(02:12:57 PM)</font> <b>Catalin Balan:</b></font> <span style='font-family: Arial; font-size: small;'><span style='color: #000000;'>da.... pe dual screens face fițe</span></span><br/>
Again, it's the same as before, except we use a different regex:
<font color=".+?"><font size="2">((dd:dd:dd [AP]M))</font> <b>(.+?)</b></font>(.+?)<br/>.
To wrap things up, we have to make a list of all the files that we have to parse and gather them up. We will use the
os.walk function and add all files that correspond to our criteria to a list, which we will then iterate over, extracting the messages.
for root, dirs, files in os.walk(folder): for file in files: f = os.path.join(root,file) fileName, fileExtension = os.path.splitext(f) if fileExtension == '.xml' and fileName[-7:] != "-assets": fileList.append(f) if '.git' in dirs: dirs.remove('.git')
We simply construct the full path of each file, check it's extension to make sure it's XML (for Trillian, HTML for Digsby and Pidgin) and check that it's not one of the asset files (again, Trillian specific). If all this is true, we add the file to our fileList and repeat. And we make sure that we don't go through our .git folder. Lots of weird stuff going on there. You don't wanna mess with it.
Next time, we will see a few basic ways to visualize this data we have collected. But I think that will have to wait until I come back from my holiday \:D/