I've been working heavily in the world of HTML and browsers over the past six months and for the most part enjoy it. It's much much better than the scene six years ago when I was at KnowNow writing fairly advanced Javascript client code that was supposed to work on Netscape Navigator and Internet Explorer - what a nightmare. Today's browser world is infinitely better. Now people only complain about box models being a pixel or two off. Well, and there's that z-index bug Microsoft hasn't fixed and probably doesn't even know about, even though everyone else does...
Anyway, my most recent learning experience has been with information extraction from Web pages - essentially extracting meaningful keywords from HTML. I must say, there's a lot of room for learning to take place here. There are several research papers I've found that are really educational, especially those that talk about extraction in the absence of a large body of other documents (corpus) to measure relevance.
As I was going through some experiments I realized that doing a decent job of extracting text from HTML requires knowledge of what 'markup' is and what the particular elements of HTML are defined to mean. Extracting meaningful phrases from markup means to ignore the markup and get to the underlying text which was marked up. But then I began to notice something - in all the advanced HTML pages that use the latest CSS to accomplish 'semantic HTML' (a phrase that I've heard tossed around pretty loosely) something is going wrong. The underlying text that is marked up is becoming gibberish. This is due to the use of CSS for layout and ignoring the effect of the tags on the text. For example, when a span tag is applied to text it is considered an 'inline' element - the underlying text is not meant to be fragmented and split apart and any extraction tool (especially a naive one that I was experimenting with) should merge the text fragments before, within and after the span element with no whitespace. But often designers will add layout and margins to the span tag in order to visually separate the text - yet the underlying markup indicates the text fragments are contiguous. How annoying. There is a simple solution - tag the text as it is intended to be read and understand the difference between 'inline' and 'block' semantics for narrative text.
No comments:
Post a Comment