Printer Friendly

News URLs tell their own stories.

Questions about validity, reliability, and authority of news items continue to plague information professionals, particularly as the news cycle becomes ever faster. I've discovered that it's increasingly feasible to assess the relative value of an online news item from characteristics of its URL. Many URLs display value hints; this trend is growing. Certain hints, especially by their absence, suggest the URL (or even its source) has lower value for professional audiences.

URLs often provide first impressions of a news item or website. URL elements (such as formats, dates, sequences, or appended arguments) may produce expectations about the source. This trend in URL messaging matters to those who seek online news and those who produce it.

Since 2001, Nexcerpt ( has scanned billions of URLs, and analyzed hundreds of millions, across thousands of news, media, corporate, and government domains. The system was designed to derive rules for evaluating the relative value of news items based primarily on the sources' URLs--rules surprisingly applicable to novel sources.

This took considerable effort to discover. Let me save you some time.


URLs communicate more than you realize. In fact, they can talk to you about quality.

Here's one very recognizable URL pattern: /2001/09/11/

While recognizable, that particular URL pattern was not commonly used in 2001. The rise of blogging platforms, which often embed "/YYYY/MM/DD/" in URLs, may be what made that mercifully logical (and sortable) format so popular today. (It now appears in 15% of URLs scanned by Nexcerpt.)

These preferred URL patterns arise naturally, by silent agreement. Early, methodical naming schemes were inevitable. Resource managers and programmers reused good schemes; others copied success. URLs containing "/1995/" led to content from that year. If the URL included "?id=555" then "?id=556" would appear next. All this seemed obvious.

Less obvious, though, was that the rigor of URL patterns, and the thinking it reflected, became increasingly correlated with the value (including authority and quality) of associated content. Solid content sources hire solid information professionals who produce solid schemes-- quality attracts quality. But even in the best systems, implementation details offer insight into the character and priorities of the sponsor.

Especially over time, URL patterns reflect the standards of an organization, its offline resources, and its business practices. Many URLs provide hints of goals or missions; some reveal disarray in a technical realm or in an entire organization.


Digit-free URLs suggest lower value content. Some URL assignment schemes focus on search optimization, using strings of keywords containing few (or no) dates or numbers. Nexcerpt has noted apparent correlations between such digit-free URLs and content of lower value, at least to our professional clients.

Nexcerpt's client list is business-oriented, though it also contains some nonprofits. Most focus on strategic matters in law, research, technology, marketing, or public policy. Their Nexcerpt accounts provide up to 10 Boolean queries of any length. (Our longest query is more than 1,300 characters; the most complex contains more than 130 terms.) Given that degree of control, excerpts that match queries, and reach accounts, are likely to be of high value.

When the URL contains no digits, the associated content is less likely to match client queries. That is, an absence of URL digits is a predictor of low value content. Why should this be for "keyword optimized" URLs?

We considered whether sources using digit-free URLs (rather than URLs themselves) might be culpable. Some are magazines familiar from your local grocery store checkout counter; others promote a (mostly conservative) political agenda; many report only local news. Yes, faddish, political, or hyperlocal sources offer fewer items of value to strategic planners. We scan some anyway, to broaden reporting on social trends--one of which is that they date or number less content!

However, URLs containing digits from those sources (when they provide such URLs) are excerpted at higher rates. URLs containing digits tend to point to articles of higher value (by client keyword relevance), even within a single domain.

We're left wondering why. Do systems pursue a (dubious) theory that digits in URLs harm SEO? Do writers on political topics prefer not to date their work? Or are fluff pieces given fluff URLs?


Digits and associated arguments convey meaning. Common patterns notwithstanding, online news URLs remain as diverse as the domains they inhabit. Many government entities, corporate groups, publishing families, broadcast venues, and news outlets tweak their content-naming scheme to match the organization's style--or may reveal it unintentionally.

Here are a few numeric identifiers (but no pure date stamps), excluding English elements from each URL. These identifiers don't represent the same content--they merely demonstrate diversity among numeric schemes, how a source may seek to be (or accept being) perceived, or coherence between technical and editorial missions.

?id=18617786 ( ABC News

-62220678.htm ( CNET Asia

/2125248/ ( Ask Slashdot

/16270325/ ( Yahoo! News AU

Large values suggest high-volume content around the clock. Now consider these numeric elements of URLs:

/488084 ( Al Hayat

/440702/ ( International Business Times AU

?story=44127 ( Alibi

?articleid=1656537 ( Archives of Internal Medicine

?nxd_id=641582 ( Arkansas Matters

Al Hayat and IBT are numerically matter-of-fact. Alibi views each item as a "story," while JAMA calls each an "article." By comparison, "nxd_id" at Arkansas Matters seems arbitrarily dry. The next group of URLs incorporate forms of a date:

/201303011053.html ( All Africa

/201331145711873289.html ( Al-Jazeera

/AJ201303010078 ( Asahi Shimbun

All Africa appends a sequence to the publication date; Al-Jazeera adds a timestamp, unhelpfully dropping zeroes from month and day. Asahi Shimbun is more readable, but a (superfluous) "AJ" tags items from "ajw" (Asia Japan Watch). As though to offset that redundancy, the ".html" is removed. And then there's Arizona:

/article_a58a9f67-1a06-5ef6-8944-f6b24ca5677f ( AZ Starnet

This system suggests that content management was bid out. Perhaps AZ Starnet prefers to focus on reporting--and doesn't expect humans to interact with URLs.

My job [see the sidebar on Nexcerpt on page 50] includes curating more than 6,000 such active sources, representing hundreds of distinct schemes. It's fascinating to consider what such diverse URL patterns reveal.


URLs may meet or confound expectations. URL schemes often contradict natural expectations. For example, while scanning for current awareness, it may seem reasonable to ignore URLs self-identifying as "archive"--but that's not wise.

Today, 5% of Nexcerpt sources use "archive" in new item URLs. In 2001, the usage represented less than half of one percent.

Walmart (, Stanford Securities Class Action Clearinghouse (, and Fuel Cell Today (, among others, embed "/news-archive/" in new item URLs, along with the date of publication.

New URLs contain "/archive/" across such diverse sources as Sacramento Bee (, Netcraft News (, Reason (, The Stranger (, and Talking Points Memo (, to name only a few. Again, these URLs also embed the publication date. But, when new items exist only in "archive," the common meaning appears lost. It's puzzling to add that patina to "news."

On the other hand, some naming schemes are unambiguous. Sports reporting produces massive volume. From 15% to 30% of news search results are game previews, results rundowns, or player interviews. With hundreds of venue sponsors and team mascots, any query focused on cities, brands, or animal species tends to return a clutter of (irrelevant) sports news.

We've developed several rules to reduce that noise. The simplest: Ignore URLs containing "/sport" (followed by optional "s" or punctuation) and then any of the following: calendar|headlines|hq|info|news|podcast|roundup|scores.

That single rule helps us avoid more than 8,000 "news" items daily. Since important, business-related sports items echo elsewhere under "/sport"-free URLs, we reduce scanning with no loss of meaningful coverage.


On the noise-control front, 15% of our rules transform URLs to reduce duplicated content. It's astonishing how many "professional" news sites unintentionally offer multiple URL variations, assigning several nonparallel URLs to one item. We canonicalize URLs to correct such errors and avoid tasking servers with redundant requests.

"The Ugly URL" is another article, but if your system produces duplicate, missing, or absurd URLs, people notice!

On that note, I'll congratulate The Horse ( for a recent upgrade. It offers valuable reports, for example, about healthcare ( ter-health-concerns).

Until recently though (Feb. 27, 2013, to be precise) that URL--most of their URLs--carried a cookie: 4iMY1dXIMhUV5J_MSLubqyazazQMSwjBiUyLoE 47eaKVPKAwapOxN6jU6uQL2LG_xCNh2Ou4lw96 V2hMdc9FOmWlqcEU5JNorj7QnJdSwrKSPdtzIoB MiiH8fzOT3CJE_TBwrE2DT6_ksXFSyOckGx9Ky3 m-_SAGox8bKL131GsmlgQ2))/free-reports/ 30922/winter-health-concerns

Please try not to do that. Trust me. It's for your own good.


Perceptive people are already assessing your website and content, at least in part, by reading your URLs. Some become very adept at assessment. We all do this to some extent, though often unconsciously.

Consider again that recognizable date (digits in "/NNNN/ NN/NN/" format), how we interpret it, and what it has replaced.

If we observe a sensible year (especially 1990 to 2013), month (01 to 12), and day (01 to 31), we presume that's the publication date, particularly if the first four digits unambiguously match a recent year--"/2013/" is more convincing than "/13/". Some sources "reprint" very old editions, back even beyond "/1900/". However, most "dated" URLs are from the last decade or two.

We also are more likely to assume a date if the month is padded with a zero. That is, "/2013/01/" is clearer than "/2013/1/" (online, in fact, the latter more likely means the first quarter or volume of 2013). Zero padding also makes the day more recognizable (and URLs sort chronologically).

As noted earlier, blogging popularized "directory" slashes above other delimiters. Although dashes (YYYY-MM-DD) were relatively common 10 years ago, perhaps reflecting the ISO 8601 standard, first published in 1988 with a third edition published in 2004 (summarized humorously in the XKCD cartoon;, few online media still use them. Underscores were, and remain, more rare.

In the early 2000s, more media (especially in EU) used "European" dates (where "/09/11/" means "November 09"), but that has waned. Some platforms use no delimiters (e.g., "20130415"), a practice that is also fading.


Personally, I like UNIX (aka POSIX) timestamps. They're precise and unambiguous--being the tally of seconds since 1970/01/01/00:00 UTC--but they're awkward for humans, as they now contain 10 digits (~1360000000).

When I began building rules in 2001, 10% of Nexcerpt sources (then among only 2,000) employed UNIX timestamps in their URLs. Today, among 6,000 sources, use has fallen to below half of 1%. One Unix devotee is American Lawyer (amer, which may value the precision.

Unix also flickers among ABC broadcast affiliates, and Fairfax Media members such as Canberra Times ( and Sydney Morning Herald (, as legacy elements.

Julian dates in URLs appear to be a thing of the past. Among our sources, major newspapers in Boston and Manila finally abandoned them, nearly simultaneously, several years ago. I think I speak for all sane people when I say, "Thanks for no longer rendering the things which were Caesar's."

Myriad other date formats are also increasingly rare in URLs--and rightfully so. (I find it stunning that defaults other than "/YYYY/MM/DD/" still appear in some blog platforms.) The bottom line is that some formats win, while some lose--and people recognize a winner.


Now we're all news producers. Eventually, you'll likely be involved in creating some new website or online repository. Please do not neglect or reinvent common practice. Your novel URL structure may be cute (not helping your audience), clever (actually confusing them), or fascinating (perhaps to Spock, or sentient androids, but not to humans). Your scheme may support log analysis (already being done) or SEO (ditto). You may even sense that your URLs are "obscured" (I used to work at the National Security Agency, so I'm laughing right now) from search engines that wish to understand your naming conventions.

Your giddiness over such novelty carries a price. Increasingly, if your URLs don't reflect a recognizable structure, perceptions of your content are tainted. I may be one of only a few people studying URLs at this volume and detail, but I am not the only such person. We're earnestly seeking better ways to assess the quality of online content. Our focus is on selection rules and ranking algorithms. If your URLs ignore common practice, or appear random, how do you think we'll score that?


To close, here are some bizarre schemes we've seen in production. Sources persisting in such behaviors shall remain nameless.

During 2010, one international source numbered URLs in reverse sequential order, like a countdown. (That one I'll name: Asahi Shimbun, which had the good sense to stop in 2011, several months before zero.)

One major technology source uses URLs that encode a sequential article number with the (increasing) number of days since publication. Thus, the URL for an article changes completely every day--made more dramatic by its converting the string to base 36! (It's easy, but not obvious, to derive the unique article number by reducing to base 10 and applying a simple modulo test.)

A significant number of medium-market newspapers use a scheme that constantly changes the date in every URL. Each URL contains a sequential (22-character hexadecimal) document ID. That ID persists, but the date in the URL matches whatever date you retrieve the article. In other words, you perceive "today" as "publication date," no matter when you look.

Those make no sense to me--and I find sense in such things for a living.

To review your own URLs, consider at least the following points:

1. Your URL structure may be the first impression a reader has of your content.

2. As a rule, URLs with recognizable dates or digits point to valuable content.

3. Any website "design" should include schemes for efficient and readable URLs.

4. As a rule, noisy "random-looking" strings in URLs ignore points 1 through 3.

Readers of Online Searcher likely have a professional obligation to understand URLs. All consumers and producers of news will be wise to consider what URLs are communicating--whether we realize it or not.

Universal Uniformity University

Reading URLs is the new literacy. It's an essential job skill, which we have little excuse for lacking: URLs haven't changed much in 20 years.

As early as 1991, Tim Berners-Lee at CERN described the URL system we still use. For a sense of how stable it has been, see his 1994 "Universal Resource Identifiers" ( I find it hilarious that in the earliest descriptions, Berners-Lee uniformly used the word "Universal," while "Uniform" is now used universally!

To understand URLs fully, it's helpful to compare URIs-not "Locators" but "Identifiers." Incomprehensibly, the W3 link to 2005's "Uniform Resource Identifier (URI): Generic Syntax" is broken! A copy is available from IETF (

After 200 Million URLs, I've Seen It All

Not long after Nexcerpt's launch, Barbara Quint reviewed our service in the April 2003 issue of Searcher ( Her "killer product" quote is still linked from our site ( We've provided custom excerpts to private clients every day since. For examples of our Exfacto! brand of automated public feeds, see Anti-Phishing Working Group ( or Crowdlanding (

Over the years, we've grown from observing about 20,000 new articles per day to the current rate of some 80,000 new articles per day (from more than 6,000 online sources). In February 2013, Nexcerpt monitored its 200 millionth article!

Nexcerpt invokes source-specific rules (as regular expressions) to assess the likelihood that URLs provide valuable content. Daily reports detail the text volume, keyword tally, and other performance data seen for each URL. Where Nexcerpt's rules are solid, article data reflect it. Otherwise, article data reveal a rule shortcoming, which we address.

These rules have become very effective in assessing article value-based primarily upon the internal structure of URLs. Across our 6,000 hand-curated sources, our processes observe some 4 million URLs per day. Based upon URL structure alone (and the accrued knowledge within our rules) we assuredly discard 98% of those URLs; more than half as not current, the rest as of limited value.

It's likely that you already make similar judgments. If you don't, you may want to start reading URLs more closely!

After leading design and development of Nexcerpt, my focus turned to quality control for these URL assessment rules. Some rules are created or validated by automated processes. However, it's hard to beat the human eye for noticing new patterns, commonalities, and anomalies across such a mass of data.

This has led me to scan all 200 million URLs by eye. (Part of me recoils at that admission-at least my mouse wrist!) Yes, over 10 years, I've personally viewed all those URLs and their associated performance data. I'm so accustomed to it that a visual scan of 40,000 URLs each morning, and another each evening, typically consumes less than an hour of my day, though I often spend longer tweaking rules accordingly. And, yes, I still enjoy it!

That's how I accrued uncommon expertise on the ways media sites form (and malform) URLs. Nexcerpt captures the full range of behaviors (desirable and otherwise) across domains, platforms, owners, and time. My task is to notice URL behaviors and structures and to tease out rules to assess their value.

Gary Stock ( is CEO, Nexcerpt, a customized news clipping and briefing service.

Comments? Email the editor-in-chief (
COPYRIGHT 2013 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2013 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:uniform resource locator
Author:Stock, Gary
Publication:Online Searcher
Geographic Code:1USA
Date:May 1, 2013
Previous Article:SIIA CODiE finalists.
Next Article:What were you thinking?!?

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters