Solved by verified expert:creating an RDD and a few other questions in big data using sparks. answers must be immediate
Unformatted Attachment Preview
Final Exam of Big Data
For Q1 to Q9, you must use the generic RDD rather than spark Data Frame or SQL
Q1: create a RDD containing 3 documents using the following list (5)
doc = [‘read book and read hard copy’, ‘book is hard to read’, ‘hard wood is hard to cut’]
Then print the collected data from this RDD.
Q2: Count total words’ # in the RDD above (5)
Q3: Obtain the average words # over all documents, assuming you do not know how many
documents in the list (10)
Q4: Obtain the frequencies of each word in the RDD above ((10)
Q5: Sort by word frequency and choose top 3 from the RDD above, and print the resulting RDD
Q6: Do word count in each document (you want to know each word’s frequency in each
Q7: Save the following data into a CSV file (with header). So there should be 5 columns and 6
rows (including one header row). Now you use sc.textFile() to read data from the CSV file,
obtaining a RDD by deleting the first row and removing the record with tax amount less than
name education gender income tax
Q8: Create a RDD with two columns: name and taxrate = tax / income if income > 0, otherwise
0. Then sort people by ascending order of taxrate (10)
Q9: Calculate the average income for the education group and gender group respectively ((5)
Q10: Using sc.textFile() read data from the CSV file, obtain a RDD by deleting the first row. Now
you need to convert the RDD into spark data frame using the header as the field names. Then
print the data frame. (10)
Q11: Register the Data Frame above as a table called profile and use spark SQL to get the
average net income (income subtract tax) within each (education, gender) group (10)
Q12: create a new column ‘net’ in the DF above. The rule is, if income>tax>0 then net=income –
tax if income=tax=0, then net = 1000, else/otherwise net = 0, then calculate the max(net)
within each education group (shown list, such as [(‘C’, 36000.0), (‘H’, 89200.0), (‘U’, 52000.0)] ).
Note, you need to use API of DF (rather than spark DF SQL) to solve this question (10)
Q13 Bonus 30 – add to total mark up to 100): The following questions are designed to practice
Object-oriented programming(OOD). We here have defined a class invest:
tot_return = 0.0
def __init__(self, investprop1 = 0.5, goodinvest = 0.65, seed = 7):
self.gain_rate1 = None
self.gain_rate2 = None
self.loss_rate1 = None
self.loss_rate2 = None
self.investprop1 = investprop1
self.goodinvest = goodinvest
self.seed = seed
self.is_loss = None
self.yearly_return = None
If you determine to invest a certain amount of fund (e.g. $1000) at the beginning of each year
and then draw the total amount (with all returns) at the end of this year, then you invest the all
the fund (i.e. total balance including gain or loss since the first year’s investment)….
Each year you split the total fund into two investment portfolios based on a proportion value -investprop1 (a class variable, which means you choose to invest amount investprop1*$1000 on
the first portfolio and (1- investprop1)*$1000 on the second portfolio).
In addition, every year can be ‘good’ or ‘bad’ based on the probability value ‘goodinvest’ (such
as 30%). If the event is ‘good’ by a random draw with the probability ‘goodinvest’, i.e. if the
class variable ‘is_loss’ is 0, then you will get fund gaining based on the interests rate
‘gain_rate1’ and ‘gain_rate2’ for two portfolios respectively, otherwise you will suffer fund loss
based on the loss rate ‘loss_rate1’ and ‘loss_rate2’ for two portfolios respectively.
Please keep writing the Python codes to define the following class methods to realize the
def lossprob(self): -> to determine ‘is_loss’ is 0 or 1 and print result
def getyearly(self): -> to determine total fund amount at n-th years’ investment n=1,..
def totalfund(self): -> to report total fund after n-th years’ investment, n=1,..
Then you can try or implement the class by generating several instances (objects) by passing
the values of class features such as ‘gain_rate1’, ‘gain_rate2’…, then do investment and print
total return each year.
Purchase answer to see full
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.Read more
Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.Read more
Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.Read more
Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.Read more
By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.Read more