Search Engine question [Archive]

View Full Version : Search Engine question

Mister X

07-08-2003, 10:28 AM

I'm just curious about whether or not pages with server-side code in them rank as well as vanilla html pages. I know that when a spider hits an index.asp it doesn't see the asp code, just the html after the includes are inserted. BUT does the search engine take into account the fact that it IS asp and rank it lower?

wsjb78

07-08-2003, 10:32 AM

Well, Google doesn't rank them differently but Google doesn't like all too many parameters attached to a url... especially not session ones...

If you're worried have .html run as .php or .asp....

Cyndalie

07-08-2003, 11:27 AM

You won't rank lower due to a page type of extension, however you are correct in stating that search engines can only index the static html code of your pages.

Mister X

07-10-2003, 12:52 PM

Originally posted by Cyndalie
You won't rank lower due to a page type of extension, however you are correct in stating that search engines can only index the static html code of your pages. I'll clarify a bit. If my asp has the following include statement: !--#include file="footer.html"--
The page you see in the browser includes the html from the footer file but it doesn't see the asp include statement. I don't use any variables in the urls so I'm thinking that that page will qualify as static content. But I just wanted to make absolutely sure, hehehe. Since there is no penalty for the file extension it seems that I should be in the clear. I could really have used SSI and made it a .shtml page but we have asp on the server so I figured I might as well use that to save having to change stuff individually on 20 pages, hehe.

Thanks for your input.

Cyndalie

07-10-2003, 02:13 PM

Just don't [ .. ! include... the headar content, the search engines needs the head tags static in order to read the title and metas. Anything else on the page has to be in the static CODE to be read by SE's. SE's cannot spider includes or the content within them yet.

Mister X

07-11-2003, 12:42 AM

Originally posted by Cyndalie
Just don't [ .. ! include... the headar content, the search engines needs the head tags static in order to read the title and metas. Anything else on the page has to be in the static CODE to be read by SE's. SE's cannot spider includes or the content within them yet. Actually it never even crossed my mind to use an include for the head section, hehehe. I guess I'm still unclear on exactly how the includes work... really not a programmer. I was under the impression that using the includes (either SSI or asp or whatever) was essentially invisible to a search engine because the server generates the page on the fly (server side) and the spider never actually sees that include statement. I know when I "view source" on a page I never see the include... just the code that was inserted. I guess I'm just not clear on how what a spider can see and what a browser sees are different in that situation.

At any rate I'm guessing that using includes to pull the footer and the news portion of a page would probably not seriously affect a ranking even if the spider doesn't see it.

wsjb78

07-11-2003, 04:04 AM

Cyndalie

07-11-2003, 10:54 AM

The search engines see's what code is on your page when you upload it to the server, not what you see when you View source on the page from the web.

If you code is all includes, you have no chance of ranking that site unless you can plug in some static content, either hidden in code or visible on the actual page.

Trust me I know, I've been battling it out with a Cold Fusion site that is 100% includes and html content plugins where the only pages I can get ranked are the static promotional pages. What you see from the web has no bearing on what the SE see's. An example is http://SmutDrs.com - every peice of code on that site is from an include. Not one bit of static content. Can you tell? Only the webmaster truly knows.

SE's cannot spider or index included content yet. They are just now becoming able to spider and index flash movies however :) Google can do links and AllTheWeb.com and indes links and actual text in the flash movie :)

wsjb78

07-11-2003, 11:55 AM

Coldfusion may be different from PHP und ASP....
In PHP when you call a file then it will get parsed by the PHP engine and only html output will be submitted to the external caller... I'll just make an all include php file myself and then I'll see...

Well, google did index that:

http://216.239.53.104/search?q=cache:pOZK235p0TsJ:www.smutdrs.com/+allinurl:smutdrs.com&hl=en&ie=UTF-8

http://216.239.53.104/search?q=cache:4lw2y0p406YJ:www.smutdrs.com/index.cfm%3Ffuseaction%3DDearJohn+allinurl:smutdrs .com&hl=en&ie=UTF-8

NetRodent

07-11-2003, 12:41 PM

Originally posted by Cyndalie
The search engines see's what code is on your page when you upload it to the server, not what you see when you View source on the page from the web.

100% wrong. A search engine sees exactly what you see when you view the source of your page. There's no way for a spider to see the "uploaded" version of the page. A search engine's spider makes the same type of HTTP request as a browser, so it sees exactly the same thing.

Originally posted by Cyndalie
If you code is all includes, you have no chance of ranking that site unless you can plug in some static content, either hidden in code or visible on the actual page.

Do NOT try to hide text. Hidden text is just about the worst offense you can commit in the eyes of a search engine. Search engines want to deliver surfers to a page that relevant for the term they searched for. As such, they want to index a page based on what the surfer will see, not based on the words the webmaster wants to get traffic from. Let me repeat one more time, DO NOT HIDE TEXT.

Originally posted by Cyndalie
Trust me I know, I've been battling it out with a Cold Fusion site that is 100% includes and html content plugins where the only pages I can get ranked are the static promotional pages. What you see from the web has no bearing on what the SE see's. An example is http://SmutDrs.com - every peice of code on that site is from an include. Not one bit of static content. Can you tell? Only the webmaster truly knows.

Taking a quick look at your page, I don't think your problem is includes. Most of the pages I looked at seemed to have very little actual text on the page. The pages that do have text, tend to have the most spider friendly parts at the end of the page, after lots of javascript and page layout junk. Your title and meta-tags also appear a bit "stuffed" and are the same on every page.

Originally posted by Cyndalie
SE's cannot spider or index included content yet. They are just now becoming able to spider and index flash movies however :) Google can do links and AllTheWeb.com and indes links and actual text in the flash movie :)

Again, includes are not a problem and never were. A few years back search engines tended to avoid pages that looked dynamic (ie had query strings, or dynamic extensions) but that was mainly due to a fear of the spider getting stuck in a recursive loop and not from inability.

Originally posted by Cyndalie
Just don't [ .. ! include... the headar content, the search engines needs the head tags static in order to read the title and metas. Anything else on the page has to be in the static CODE to be read by SE's. SE's cannot spider includes or the content within them yet.

There are no special requirements for the header of a page as compared to the body. There is no problem using a dynamic header as long as its done server side.

If you want to learn more about search engine optimization, check out:
http://www.searchenginewatch.com
http://www.webmasterworld.com
http://www.searchengineforums.com

One last time, do NOT hide text. Its like begging to be banned.

Mister X

07-11-2003, 03:05 PM

Hmmm... seems to be a bit of a disagreement on that... lol. I'm pretty much in the camp that says the spider can't see anything that the server doesn't send to it. So if it asks for a page with includes the server is going to put the content into the page and THEN send it to the spider. Basically there are 3 ways to access a file on a webserver. HTTP, FTP and SSH/Telnet. FTP and SSH/Telnet require a password and username in most cases and HTTP doesn't see any difference between Joe Surfer and Moe Spider. And a SE spider doesn't use FTP or other protocols because it is spidering WEB content. If a spider can see what is actually on the server as opposed to what is served by HTTP then there would be a HUGE security hole. That would mean that someone could write a script to emulate a spider and download all the .htaccess and other files on a server.

Mister X

07-11-2003, 03:15 PM

Just for the hell of it I did the following search on Google:
http://www.google.ca/search?q=eromodel+cash+lanny+barbie&ie=UTF-8&oe=UTF-8&hl=en&meta= There is one result which is: http://www.prettyteenmovies.com/EromodelCash/Index.asp

The quoted text is: ... 06-01-03 Eromodel Cash We are pleased and proud to announce that we have signed
June 2003 Penthouse Pet Lanny Barbie to be a part of our online family! ...

That text comes from the news scroller. And it is called via: !--#include file="news.htm"--

The text isn't visible anyplace else on the page.

So it seems spiders certainly do NOT have any problems with server side includes. At least not those in asp.

Cyndalie

07-11-2003, 03:46 PM

Differences of opinion are great. But you cannot say i'm wrong on this point:

"100% wrong. A search engine sees exactly what you see when you view the source of your page. There's no way for a spider to see the "uploaded" version of the page. A search engine's spider makes the same type of HTTP request as a browser, so it sees exactly the same thing. "

You misunderstand. A search engine see's what you see when you view source on a page when you are looking at it from the server and not from the web.

"I don't think your problem is includes. "

View source of http://SmutDrs.com - every peice of code on that site is from an include, there is 0 static content on the page, including the head and foot tags. I know it is includes and cfm plugins, I have seen the source. In order to optimize the head tags they had to optimize the content in the included file, so although you can see it from the web, when you look at the page on the server, all you see it the include tags, and NOT the actual head code content. The reason why they are the same on every page is because ITS PULLING FROM THE SAME INCLUDE HEADER FILE! If it seems stuffed it's because they tried to manually intergrate the metas on top of the includes. The only pages they can get this site to rank are doorways and non-dymanic pages. Cold Fusion has alot more problems than PHP and ASP, for sure!

I did not suggest to abuse hidden code, but to rather integrate indexable content relevant to your site as best into your indexable code as possible. Ideally you do not want to use includes to manage and supply textual content.

My testing over the past 3 months has shown that SE's are still unable to spider includes as content for the source page they are indexing for ranking.

Includes can be useful when use for grahical content, this lightens the source code so your actual text content stands out which can help your rankings.. - however with SmutDrs.com there is NO static text or HTML on the site whatsoever, no matter how it appears when viewed from the web. Many sites use both, just optmize the html static content around your includes and you should do great!

Appreciate your imput NetRodent, however I have been doing full time SEO for the past 5 years and have worked with all kinds of sites, designs, and programming languages. I would not mislead someone or suggest something that I did not learn from experience.

A difference of opinion is always welcome :D

Best wishes,
Cyndalie

wsjb78

07-11-2003, 03:53 PM

Originally posted by Cyndalie
View source of http://SmutDrs.com - every peice of code on that site is from an include, there is 0 static content on the page, including the head and foot tags. I know it is includes and cfm plugins, I have seen the source. In order to optimize the head tags I had to optimize the content in the included file, so although you can see it from the web, when you look at the page on the server, all you see it the include tag, and NOT the actualy head code content. Trust me, I know what I'm talking about. The reason why they are the same on every page is because ITS PULLING FROM THE SAME INCLUDE HEADER FILE!!!

How do you look on the server at the file? There's the key missunderstanding. As I have posted before google spider sees what the server sees in term of source code.... You can get the cached sites from SmutDrs....

On the servers you can even have Lynx call a php or asp or cfm page then dump it to a temp file and then view the temp file. all you see in there is static html and no includes....

As soon as there is a http request and the http server is configured to parse files then they will be parsed....
As pointed out by MisterX before, if you don't do a http request (e.g. ftp or ssh) then you will get the originial file with all includes... even wget will result in a "static" html page.

Mister X

07-11-2003, 04:49 PM

I know that you can use browser includes. That's done quite often with Javascript and in that case it's the browser that puts things together and not the server. Certainly in that case a spider wouldn't be getting all the code because it isn't set up the same as a browser. But there are pretty big differences between browser and server includes.

Mister X

07-11-2003, 05:01 PM

Originally posted by Cyndalie
In order to optimize the head tags they had to optimize the content in the included file, Well that seems pretty obvious really. Includes or not what you are doing is building a page so you have to optimise all the parts of the page. What I don't really understand is why they felt it was a good idea to use includes to pull the actual head section. Some kind of bizarre logic must be at work there.

My testing over the past 3 months has shown that SE's are still unable to spider includes as content for the source page they are indexing for ranking. With respect I proved that wrong in my prior post. If a spider couldn't access includes the page in my example would NOT be brought up by google in that query. For the simple reason that "Lanny Barbie" does not appear anywhere in the source code for that page and never has. It is ONLY in the file referenced by the include. That file isn't web accessible at all unless you happen to know the filename and path because it isn't linked from anywhere.

Cyndalie

07-11-2003, 05:11 PM

"How do you look on the server at the file? " First you download the file to your drive (via FTP) and view it in a notepad.

Ok this is kinda hard to explain..

Have you ever viewed a .cfm or .whatever file in a text pad BEFORE it is uploaded to the server or viewed from the web? Cold fusion is the worst with this.. In a text pad, not an HTML editor.

I can't type exact code or it can mess up the board but the actual page source of smutdrs look like this:

cfinclude template = " app_globals.cfm "

from head to foot. They had to manually put in metas at the top of the page, even though one of the template includes plugs it in- there even are no open and close html tags even in the actual code of the page. Because it was in a include and engines read the HTML code, if they did not do it manually the engines had nothing to read or index the site on.

BUT when you view it from the web, all content is plugged in so what you see is NOT what you necessarily have to work with when optimizing dymanic sites. Since engines are robots and not humans calling a page, they can read what the programming code looks like - this used to be applied to cloaking several years ago. Show the engine one thing and the user another - but by IP, here it's because dymanic content reacts when it's CALLED, not necessarily read. This is why when the code is cached and viewed in a brower, it even looks normal as well.

Most sites consist of both HTML and whatever dynamic programming language, and they can actually work FOR you, however if you have a say when working with a programmer, make sure they are not making 100% of your site dynamically generated it it will become optimization hell. I haven't run across this much, and look for it before committing to new site for optimization.

WSJB78 you had some great points I'm going to delve further into. I know I'm not 100% right, but this is what I have seen and worked with and it can be frustrating, since every language and site is different.

Cyndalie

07-11-2003, 05:15 PM

MisterX you're right exactly right, great application for that purpose! Thanks for the imput :)

"What I don't really understand is why they felt it was a good idea to use includes to pull the actual head section. Some kind of bizarre logic must be at work there." I wondered that myself. They said something like they built it so they can update the site via an interface rather than hire a webmaster. Paying the price now I'm afriad....

Cyndalie

07-11-2003, 05:23 PM

Small Recap:

To better understand includes I often think of them as iframes. I think of them as pulling content OFF your page rather than plugging it in LOL

A page does not have to have a dynamic URL to have dynamic code.

Includes used for graphical purposes such as headars and menus can be beneficial since it makes the actual text on the page more visible to the search engine, reducing the lines of code it has to index and better allowing your chances for a full spidering. Keeps your indexable code CLEAN. However, if you INCLUDE you navigation, make sure you use footer text links or have a sitemap link from every page that links all pages together.

NetRodent

07-11-2003, 06:05 PM

Originally posted by Cyndalie
Differences of opinion are great. But you cannot say i'm wrong on this point:

"100% wrong. A search engine sees exactly what you see when you view the source of your page. There's no way for a spider to see the "uploaded" version of the page. A search engine's spider makes the same type of HTTP request as a browser, so it sees exactly the same thing. "

You misunderstand. A search engine see's what you see when you view source on a page when you are looking at it from the server and not from the web.

Perhaps we aren't quite speaking the same language. When you say "looking at it from the server" you mean you see the raw html (same as "view source"), our disagreement is just over language. However, if you are implying that a search engine can see what a page looks like before it is served (ie <!--#element attribute=value attribute=value>), I'm going to have to continue to disagree with you. Its just not possible, includes are parsed by the server before they are delivered to the client.

Originally posted by Cyndalie
"I don't think your problem is includes. "

View source of http://SmutDrs.com - every peice of code on that site is from an include, there is 0 static content on the page, including the head and foot tags. I know it is includes and cfm plugins, I have seen the source. In order to optimize the head tags they had to optimize the content in the included file, so although you can see it from the web, when you look at the page on the server, all you see it the include tags, and NOT the actual head code content. The reason why they are the same on every page is because ITS PULLING FROM THE SAME INCLUDE HEADER FILE! If it seems stuffed it's because they tried to manually intergrate the metas on top of the includes. The only pages they can get this site to rank are doorways and non-dymanic pages. Cold Fusion has alot more problems than PHP and ASP, for sure!

I did view the source of SmutDrs.com, and what I saw is exactly what any search engine spider would see (unless you're cloaking, but that's a completely different discussion). I'll take your word that the page is made up of includes and cold fusion plugins, but just by looking at the source, I couldn't tell and neither could a spider.

I brought up your title and meta tags, not to discuss how they were included in the final html document but to suggest that possibly you weren't ranking because you repeat the same words over and over and it looks spammy. There is no technical or moral reason for a search engine not to rank a page made with cold fusion. However, there is a very good reason not to rank a page that has spammy headers and little body text.

Originally posted by Cyndalie
I did not suggest to abuse hidden code, but to rather integrate indexable content relevant to your site as best into your indexable code as possible. Ideally you do not want to use includes to manage and supply textual content.

I really don't see any techincal reason not to use includes (aside from issues of server load). Most of my search engine pages are entirely dynamic (aside from the layout and most of the graphics) and they are indexed and ranked no differently than static pages. As far as the search engine is concerned they are static pages (even down to the .html extension).

Originally posted by Cyndalie
My testing over the past 3 months has shown that SE's are still unable to spider includes as content for the source page they are indexing for ranking.

That's very interesting, but I don't see how its possible. A spider cannot know what is an include and what isn't. The document is assembled on the server before the spider can see it. I'll gladly eat my hat if you can show me how a spider could tell the difference. Can you point to any third party experiments that support your assertion?

Originally posted by Cyndalie
Includes can be useful when use for grahical content, this lightens the source code so your actual text content stands out which can help your rankings.. - however with SmutDrs.com there is NO static text or HTML on the site whatsoever, no matter how it appears when viewed from the web. Many sites use both, just optmize the html static content around your includes and you should do great!

Includes are handled before a page is served to the client, so they can't "lighten the source code" as far as a spider is concerned. Spiders
see exactly the same thing a browser sees. They issue the exact same http requests. Search Engines do not have a "magic" way of seeing anything other than what the web server sends to them.

Look through your log files and compare the byte count of a page served to a spider with the byte count of a page served to a regular user. Unless your includes spit out different results depending the time of day, the ip address or useragent of the client, or some other random element, the byte counts will be exactly the same.

Originally posted by Cyndalie
Appreciate your imput NetRodent, however I have been doing full time SEO for the past 5 years and have worked with all kinds of sites, designs, and programming languages. I would not mislead someone or suggest something that I did not learn from experience.

We've been the search engine game about the same amount of time. I don't think you would intentionally try to mislead someone, however you are saying things that don't match with my experiences, what I have read of other people's experiences, and things that are just technically not possible. Just because you can't get a page that uses includes ranked, doesn't mean includes are the reason.

NetRodent

07-11-2003, 06:29 PM

Originally posted by Cyndalie
BUT when you view it from the web, all content is plugged in so what you see is NOT what you necessarily have to work with when optimizing dymanic sites. Since engines are robots and not humans calling a page, they can read what the programming code looks like - this used to be applied to cloaking several years ago. Show the engine one thing and the user another - but by IP, here it's because dymanic content reacts when it's CALLED, not necessarily read. This is why when the code is cached and viewed in a brower, it even looks normal as well.

Cold fusion is a server side scripting language. The server doesn't know whether the client is human or a robot. A search engine spider issues basically the same HTTP request that a human via a browser would issue. There is no special backdoor for robots.

If you want to try it for yourself, open a plain connection to your webserver on port 80 and issue the HTTP commands by hand. Just make sure you hit enter twice after the last line:

GET / HTTP/1.1
Host: www.smutdrs.com
User-Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)

You'll see exactly the same thing a spider will see. If you want to play around more with direct connections, read up on the HTTP specifications:

http://www.faqs.org/rfcs/rfc2616.html

Originally posted by Cyndalie
Most sites consist of both HTML and whatever dynamic programming language, and they can actually work FOR you, however if you have a say when working with a programmer, make sure they are not making 100% of your site dynamically generated it it will become optimization hell. I haven't run across this much, and look for it before committing to new site for optimization.

At the same time, if you are the programmer (or you can get him to understand what is important) a fully dynamic site can be an optimization dream. For example, you could adjust the keyword weighting across 100K pages by changing 1 line in one file.

wsjb78

07-12-2003, 04:18 AM

NetRodent:

I completely agree with you... dynamic pages are not worse for SEs as long as you don't have all too many parameters on it. "New" approaches are to turn those parameters into "directory" path with the ReWrite function. However that does slow down the overall performance...

NetRodent

07-12-2003, 10:09 AM

Originally posted by wsjb78
"New" approaches are to turn those parameters into "directory" path with the ReWrite function. However that does slow down the overall performance...

You can also turn parameters into a directory structure by using the ScriptAlias directive. Instead of ScriptAliasing a directory to a directory, you point a directory to a script:

ScriptAlias /dynamic/ /web/dynamic/handler.pl

Then a request such as:
http://www.domain.com/dynamic/param1/param2/param3/param4.html

Would call the script /web/dynamic/handler.pl with the PATH_INFO environment variable set to /param1/param2/param3/param4.html