300 Million URLs and the Sitemaps Who Love Them

I have a funny website.

Well, I have a few funny websites. But this one is funny because it actually makes money, and it’s only one page on the backend.

It’s infinity URLs, however.

The site is MathHelper.us, a site that as of right now just teaches how to add, subtract, and multiply fractions. (I’ll add dividing fractions eventually.) But it’s geared for long tail traffic — I want to show how to add every fraction.

Check out 1/5 + 3/8. Or 9/17 plus 78/92. They probably look pretty similar, because they’re full of the same language, but the numbers and steps are different. Yay for dynamic content generation.

So, too, are their images. Those are generated dynamically as well – giving me infinite possible images, created on the fly as needed for whatever math problem is being handed to it.

The magic is in a .htaccess file at the root of the site. It has this one important line:

RewriteRule ^(.*)$ /index.php?p=$1 [L]

That means, no matter what URL folks came through, send it as a parameter to index.php. Index.php then does all the fun parsing and sanitizing, as well as the solution generation.

Now, when you search for 1/8 + 7/12, I want my search result to be first. (As of this writing, I’m second, after another website and Google’s calculator answer — which currently renders the answer as a decimal, which is useless if you’re trying to get a fraction answer.

So, the problem with Google: how do I let them know about, well, a metric gazillion URLs?

First, let’s reduce the number from infinity to, say, 100,000,000. That’s the number of math problems from 1/1 + 1/1 to 100/100 + 100/100. Then include subtraction and multiplication, and we now have 300 million URLs we want to point at. (Again, division will come later.)

300 million is a much more manageable number than infinity. But how to let Google know about them?

Enter sitemaps. A sitemap is a file that contains all the links on your site, so search engines can look in one place and see, ah, there’s everything. It makes their job easier to crawl your site, and it makes it possible to give links to Google that aren’t linked to elsewhere on the web.

Sitemaps have some limitations, however: you can only include 50,000 links in a sitemap. Additionally, you have to have them in a parent directory of the link they’re pointing to.

The solution is two-fold:

The first part is a sitemap of sitemaps. You can make a sitemap that points to other sitemaps, like

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

With that, each child sitemap – 6000 of them – can contain 50,000 urls, and this parent sitemap can contain a mere 6000 links. Now we have links to every solution from 2/9 + 7/5 to 98/75 x 45/3.

But. BUT. How do I store 6000 sitemaps?

I could generate them dynamically based on the link, like if I called them “sitemap1over1plus1over5to100over100to5over5.txt” – but that’s gross. Easier to just generate the 6000 files, am I right?

But who wants SIX THOUSAND files in their website’s root directory?

Not this guy. I shoved them in /sitemaps.

But this violates the “sitemap must reside in parent directory of urls it contains” rule. SO! We come back to .htaccess.

RewriteCond %{REQUEST_URI} !^sitemaps.
RewriteRule ^sitemap([0-9]+.*)$ /sitemaps/sitemap$1 [L]

That’s the second part. Now, anything that’s not already in the sitemaps folder (the first line there) that is the word “sitemap” followed by a number, redirects to the same filename, but in the sitemaps directory.

NOTE THIS IS NOT A 301 REDIRECT! That would change the URL that is being fetched, and violate The Rule.

But now I have 300,000,000 unique URLS pointing to unique math solutions that kids can step through for help with their homework problems.

I’m not sure if all this is going to ding me or help me on Google. I just submitted the parent sitemap two days ago. We’ll see if they decide I’m somehow spamming them or violating some other rule I couldn’t find. If my traffic goes up or down, I’ll let you know.

Leave a Reply

Your email address will not be published. Required fields are marked *