String Method Manipulation

String Method Manipulation
ASSIGNMENT:

In this lab, we’re going to explore more about string manipulation, using string methods, slicing and editing web pages. You can remind yourself about string methods here: https://docs.python.org/release/2.5.2/lib/string-methods.html

(1) Download the file footy.txt. Use set media path. Write a function that will read the contents of the file as a string, and prints out the number of times the word football occurs. Be careful – what kind of things do you have to account for, and are there string methods that can help you?

First, here’s a function to open a file, that will return the contents of that file as a string

def getData(file):

myFile = open(getMediaPath(file), ‘r’)

text = myFile.read()

myFile.close()

return text

This function opens the file for reading (that’s what that ‘r’ means). It then reads in all the text as one big string. Use this function in another function that counts the number of times the word football appears. THINK about this. Are there methods that can help you?

(2) Download the file parasites.txt. This is a file containing a list of nucleotides associated with common parasites. Look at the file in a text editor, and see what it’s structure is! You should find that there is a line, starting with a character, then the NAME of a particular parasite (such as Schisto unique AA825099). Then a new line, where the structure of the parasite starts.

Let’s imagine we want to find the NAME of a parasite containing a particular nucleotide subsequence (such as ‘ttgtgta’).

HINT: READ ALL THE SUBPROBLEMS BELOW BEFORE STARTING.

(a) Write a python function, called findSequence, that takes a single argument (a small subsequence, such as ‘ttgtgta’), and which finds subsequences of nucleotides in files like this. This program will open the parasites.txt file, read the file as a single string, and search that string for the first instance of the subsequence parameter.

PRINT OUT the index of where you find the subsequence. If the sequence is NOT found, print “SUBSEQUENCE not found”

HINT: Remember what the find method returns if something is not found? If it IS found it returns the INDEX where the sequence starts.

(b) Print out the NAME of the parasite in which you find the subseq. The name occurs at the beginning of each sequence, following a ‘>’ symbol, and ending with a newline character ‘n’. You can’t SEE the ‘n’, but it’s there at the end of every line in this file.

These n characters act as if you had pressed the return key. That means that whenever there is a new line, there is a hidden n. We can use this information to help us find the name of the parasite (because we can ‘look’ for them). So, HOW are you going to find the name of the parasite that contains a particular subsequence?

HINT: WORK out how you will solve this on a piece of paper. If you KNOW where a subsequence is, HOW do you find the name associated with that subsequence. You can use string methods to achieve this. IF you do not work this out on paper first, I will be tempted not to help you.

HINT: Normally the method you use to find something in a string finds the FIRST occurrence of the letter, or sequence, in a string. Try:

‘foxfox’.find(‘x’) to remind yourself.

There is another method, that finds the LAST occurrence of a letter or subsequence in a string. It’s called rfind. Try

‘foxfox’.rfind(‘x’)

to see how it works.

HINT: rfind also takes optional parameters. HOW can these help you find what you want? Again, make sure you figure out on paper, remembering that you can ‘find’ different points (index values) in your string. How can you use these index values to help you? For example, maybe FIRST you find something, then using that index, you can search from there to find something else.

HINT: You can extract the name of the parasite using slicing. What does slicing NEED to work? How can you figure out these things ‘automatically’?

HINT: Remember what the find string method returns if something is not found? If it IS found it returns the INDEX where the sequence starts. How could that help you?

HINT: Once you’ve found a sequence, HOW DO YOU KNOW what the name of the parasite containing that sequence is. Look at the file, and figure out how YOU know this. Then figure out what the computer would have to do to figure it out. Imagine that the file contains THOUSANDS of parasites, so do not hard code any of your answers based around the fact that there are only 3 parasites in this example.

HINT: Remember, you can use index positions as optional parameters to the find methods. For example, where to start the search, and where to stop. How can you use the position of the subseq you’ve found. Look at your notes to remind yourselves of the string methods.

HINT: You can extract the name of the parasite using slicing. In order to slice something from a string, what do you need to know?

(3) The Atlanta Journal-Constitution newspaper has a weather page at: http://www.ajc.com/weather/30301/

You’ll find a copy of this page in your media folder (called ajc-weather.html). Note it’s name. Open this file in TextEdit (right click on the file name, and select Open With…) AND in a web browser and find where it has the temperature (you might want to search for the phrase “Today in Atlanta”.

Create a program to:

– READ in this file as one string

We are going to want to slice the temperature out of the document. So we need to be able to find index values.

– SEARCH the string for the temperature. THINK: What is a reliable indicator of where the temperature will be? it’s NOT the temperature itself, which obviously can change. Is there something which always follows a temperature? Some special symbol? HOW is that represented in the HTML text?

– If you find the thing AFTER the temperature, how can you use that index position to locate something BEFORE the temperature. Again, write this out on paper.

– Slice out the temperature and print it to the screen.

Make sure you explore the HTML to try to find good clues for where the current weather information is held.

(4) Python is really cool, because it includes libraries that you can use to do some stuff for you. There is a library you can use to help you access LIVE web pages. That library is called urllib.

To use it, at the top of your python code (right at the top of the .py file), you write:

import urllib

urllib includes methods that mean you can grab stuff from real web pages. One method is called urlopen. It acts JUST like the open function for normal text – opening the webpage so you can read the text.

So to open a webpage from the internet (rather than downloading the file), read the contents, and then close that page, we do:

connection = urllib.urlopen(“http://www.ajc.com/weather/30301/”)

weatherData = connection.read()

connection.close()

Then the variable weatherData contains the same string information as if you had read the information from a file.

(a) Create a new program which uses this method to read the current temperature from the LIVE Atlanta Journal-Constitution website. It’s a little different because the format of the web page has changed. Ask me how to view the source code of the actual website, but NOW you should search for the phrase ‘Feels like’.

Your code might take a little while to run, because it has to go to the website, and download the file, and then work on it.

find the cost of your paper