Search engine algorithms.
What's the fuss all about?
Do we really need to know all this scientific stuff about search engines?' asked Mike Crehan. 'Yes!' he answered unequivocally and proceeded to explain the practical competitive edge you gain when you understand search algorithm functions.
'If you know what ranks one document higher than another, you can strategically optimize and better serve your clients. Also if your client asks, 'Why is my competitor always in the top 20 and I'm not? How do search engines work?' If you say 'I don't know-they just do'--how long do you think you're going to keep this account?'
Grehan illustrated his point by quoting Brian Pinkerton, who developed the first full text retrieval search engine back in 1994. 'Picture this,' he explained, "A customer walks into a huge travel outfitters store, with every type of item, for vacations anywhere in the world, looks at the guy who works there, and blurts out, 'Travel.' Now where's that sales clerk supposed to begin?"
Search engine users want to to achieve their goals with minimum cognitive load and maximum. They don't think carefully when they are entering queries; they use inaccurate three word, and haven't learned proper query formulation. This makes the search engines job more difficult.
Heuristics, abundance problems, & the evolution of algorithms
Grehan went on to explain the important role that heuristics play in ranking documents. "A fascinating combination of things come together to produce a rank. We need to understand as much as we possibly can, so at least when we're talking about what ranks one document higher than another, we have some indication about what is actually happening."
Grehan described the progression of search algorithms over time. In early search engines, text was extremely important. But then search researcher Jon Kleinberg discovered what he termed 'the abundance problem.' The abundance problem occurs when a search for a query returns millions of pages all containing the appropriate text. Say a search on the term 'digital cameras' will return millions of pages. How do you know which are the most important or authoritative pages? How does a search engine decide which one is going to be the listing that comes to the top? Search engine algorithms had to evolve in complexity to handle the problem of over- abundance.
Insights from Ask Jeeves
Ask Jeeves is the seventh ranked property on the web and the number 4 search engine, according to Rahul Lahiri from Ask Jeeves. Lahm described a number of components that are key to Ask Jeeves search algorithms, including index size, freshness of content and data structure. Ask Jeeves' focus on the structure of data is unique and differentiates its approach from other engines, he said. Them are two key drivers in web search: content analysis and linkage analysis. Lahiri confirmed that Ask Jeeves looks at the web as a graph and looks at the link relationships between them, attempting to map clusters of related information.
By breaking down the web into different communities of information, ask Jeeves can rely on the 'knowledge' from authorities in each community to better understand a query and present more on-topic results to the searcher. If you have a smaller site, but one that is very relevant within your community, your site may rank higher than some larger sites that provide relevant information but axe not part of the community.
Why co-occurrence is important-Dr Garcia
Dr.Garcia is a scientist with a special interest in Artificial Intelligence and Information Retrieval. He explained that terms that co-occur more frequently tend to be related or 'connected.' Furthermore, semantic associations affect the way we think of a term. When we see the term 'aloha' we think of 'Hawaii' because of the semantic associations between the terms. Co-occurrence theory, according to Garcia, can be used to understand semantic associations between terms, brands, products, services, etc.
Dr.Garcia then posed a question. Why should we care about term associations in a search engine'? His answer: Think about keyword-brand associations. This has powerful implications for search marketing. Where is the evolution of the search algorithm going? Grehan had a ready answer- He expects the introduction of probabilistic latent semantic indexing and probabilistic hyper text induced topic search.
Web Services Transactions and Heuristics
The current forerunners for the title of Web Services transactions standard are the WS-AtomicTransaction or WS-Transaction Management specifications. Both of these provide protocols intended for interoperability of existing transaction processing systems.
In an earlier article, we saw how the concepts of ACID transactions have played a cornerstone role in creating today's enterprise application environments. They provide guaranteed consistent outcome in complex multiparty business operations using a two-phase commit protocol. The current forerunners for the title of Web Services transactions standard are the WS- AtomicTransaction or WS-Transaction Management specifications. Both of these provide protocols intended for interoperability of existing transaction processing systems. However, as we'll see in this paper, only one of those specifications addresses an important aspect of current transaction processing systems: heuristic errors and how they can be dealt with in an interoperable manner.
The problem with failures
Imagine you walk into a bank and want to perform a transaction (banks are very useful things in transaction examples). That transaction involves you transferring money from one account (savings) to another (current. You obviously want this to happen with some kind of guarantee, so for the sake of this example lets assume we use an ACID transaction.
To ensure atomicity between multiple participants, a two phase commit mechanism is required: during the first (preparation) phase, an individual participant must make durable any state changes that occurred during the scope of the atomic transaction, such that these changes can either be rolled back (undone) or committed later once consensus to the transaction outcome has been determined amongst all participants, i.e., any original state must not be lost at this point as the atomic transaction could still roll back.
Assuming no failures occurred during the first phase (in which case all participants will be forced to undo their changes), in the second (commitment) phase participants may 'overwrite' the original state with the state made durable during the first phase.
In order to guarantee atomicity, the two-phase protocol is necessarily blocking. If the coordinator fails, for example, any prepared participants must remain in their prepared state until they hear the transaction outcome from the coordinator. All commercial transaction systems incorporate a failure recovery component that ensures transactions that were active when a failure occurred are eventually completed. However, in order for recovery to occur, machines and processes obviously need to recover! In addition, even if recovery does happen, the time it takes can be arbitrarily long. So, in our bank example, despite the fact that we're using transactions and assuming that the transaction system is reliable, certain failures will always occur, given enough time and probabilities. The kinds of failure were interested in for this example are those that occur after the participants in the two-phase commit transaction have said they will do the work requested of them (transfer the money) i.e., during the second (commit) phase. So, the money has been moved out of the current account (it's really gone) and is being added to the savings account, when the disk hosting the savings account dies, as shown in the diagram. Usually what this means is that we have a non-atomic outcome, or a heuristic outcome : the transaction coordinator has said commit, one participant (savings account) has said 'done", but the second one (current account) has said 'oops!". There's no going back with the work the current account participant has done, so this transaction isn't going to be atomic (all or nothing).
Why is this a Problem?
Most enterprise transaction specifications, such as the OMG's Object Transaction Service and the X/Open XA specification , and implementations (like CICS, Tuxedo and the Arjuna Transaction Service) allow for heuristics to occur. This basically means that the transaction system can be informed (and hence can inform) that such an error has happened. There's not a lot that can be done automatically to fix these types of error. They often require semantic information about the application in order to restore consistency, so have to be handled by a system administrator. However, the important thing is that someone knows there's been a problem,
Imagine that this error happens and you don't know about it! Or at least don't know about it until the next time you check your account. Not good. Personally I'd like to know if there's been an error as soon as possible. In our bank scenario, I can go and talk to someone in the branch. If I was doing this via the internet there's usually a number I can call to talk to someone.
Now why is this important? Well, there are a few Web Services transactions specifications around that can be used in this scenario: WS- AtomicTransaction and WS-ACID Transaction (part of WS-Transaction Management). Only WS-ACIDTransaction provides support for heuristic errors to be sent from participant to coordinator and from coordinator to end-user. This seems like a strange omission to the other specification, because these kinds of errors do happen.
One obvious question is: why don't heuristics appear in the WS- AtomicTransaction specification? Without a definitive answer from the authors we can only speculate. Maybe they don't think heuristics can happen in real life? However, given that heuristics occur in other specifications and implementations, another possibility is that the authors believe that these kinds of faults can and should be dealt with behind the service boundary and not exposed to applications. Unfortunately that's not the case; in most cases heuristic errors can only be resolved with semantic information about the application that they occurred within. Automatic resolution rarely happens and manual resolutions can take arbitrarily long periods-, so knowing that an error has happened (and an error that potentially blows all ACID guarantees away) as quickly as possible, is important to users and applications.
In reality this apparent omission from that specification is not as bad as might first seem. The WS-AtomicTransaction specification does have an error InconsistentinternalState that can be returned by participant to indicate 'it cannot fulfil its obligatons. However, this isn't sufficient, because there isn't enough information given for the recipient (the coordinator) to know what heuristic decision the participant took: did it commit, did it roll back, or did it do something else entirely different? Of course you can use WS-AtomicTransaction to communicate these errors.
Unfortunately you just can't do it within the specification. You would have to overload SOAP faults (for example), or maybe use some proprietary extension (repeat after me: vendor lock-in is not good). Not a good idea for interoperability and/or portability. WS-Atomic Transaction and WS-ACID Transaction are really meant for interoperability of existing transaction service implementations, where heuristics originated. This makes the omission of heuristics in WS-Atomic Transaction even more striking and hopefully something that will be addressed through a standards body. With any luck, the experiences of both sets of specifications ran be leveraged into a single industry standard.
 Gray, Jim and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufman, 1993.
 Distributed Transaction Processing: The XA Specification, The Open Group, February 1992.
|Printer friendly Cite/link Email Feedback|
|Date:||May 1, 2005|
|Previous Article:||Defining null values in Microsoft Access.|
|Next Article:||Comprehensive spyware solution for home users.|