Python programming newbie question

Project update.

The chrome tip is awesome, mostly because it allows me to visualize where I am and where I want to go in the document. The page I’m working with is a big table with like two rows, and two columns, and the content I want is in a cell the second column and the second row.

So this is the selector from Chrome:

body > center > table > tbody > tr:nth-child(2) > td:nth-child(2)

Here is what you do to get there in BeautifulSoup:

soup.body.center.table.tr.next_sibling.next_sibling.td.next_sibling.next_sibling

What happened to tbody?

I’ve had some success selecting some stuff I need, mostly by finding a pointer to it, and then just assigning that pointer to a variable. For example, here is what I did to get the title:

title = soup.body.center.table.tr.next_sibling.next_sibling.div.text.lstrip()

But I can’t figure out how to extract the story in one big fell swoop. If I use the path above (soup.body.center.table.tr.next_sibling.next_sibling.td.next_sibling.next_sibling) to load it into a variable, for some reason I can’t select the td element. The first selectable item is the first <p> tag inside the <td> tag. I have no idea why that is. It got late and I had to go to bed. If I can’t figure out how to get rid of the td tags, then I guess I can also try selecting all the descendants of something. Or maybe a loop that will get all the p tags?

I was able to get a story from the old site, load it into an XML page, and import it into Wordpress.

Woo!

As cool as all that sounds, I don’t think I’m going to come up with a fully automated solution. The pages were created over an eleven year span and there really isn’t a good way to predict how it was done. For example, there is no guarantee that a selection that works in 2008, will work in an issue from 2006 and so on. Also, to do its magic, BeautifulSoup adds closing tags to single tags like <br>. This means that the HTML can end up with big blocks of white space. I could figure out how to delete the closing tags, but then I’d need to do some sort of wizardry to make sure I’m not deleting anything necessary. I think the thing to do is just grab all the P tags, then import them as the story. I will have to do a little post-processing by hand, but I will have mostly automated the bulk of the process, and the manual work will just be a little clean up.

Now I have to figure out how to grab the tags from the spreadsheet. Once I figure that out, I will have my proof of concept done and I can tell the editor what I’m up to.

I’m soooo excited!

CSV is a popular choice.

Gendal is that a library or do you mean the file format?

I was going to start with the library rhamorim posted:

I was talking about the file format, xlrd looks better in every way if you are starting with an excel spreadsheet. Otherwise, csv and the available 5 billion libraries.