# Expert Answer :big data spark creating an RDD

Solved by verified expert:creating an RDD and a few other questions in big data using sparks. answers must be immediate
final_exam_of_big_data__2_.pdf

Unformatted Attachment Preview

Don't use plagiarized sources. Get Your Custom Essay on
Expert Answer :big data spark creating an RDD
Just from \$10/Page

Final Exam of Big Data
Email:
Name:
For Q1 to Q9, you must use the generic RDD rather than spark Data Frame or SQL
Q1: create a RDD containing 3 documents using the following list (5)
doc = [‘read book and read hard copy’, ‘book is hard to read’, ‘hard wood is hard to cut’]
Then print the collected data from this RDD.
Q2: Count total words’ # in the RDD above (5)
Q3: Obtain the average words # over all documents, assuming you do not know how many
documents in the list (10)
Q4: Obtain the frequencies of each word in the RDD above ((10)
Q5: Sort by word frequency and choose top 3 from the RDD above, and print the resulting RDD
(5)
Q6: Do word count in each document (you want to know each word’s frequency in each
document). (15)
Q7: Save the following data into a CSV file (with header). So there should be 5 columns and 6
rows (including one header row). Now you use sc.textFile() to read data from the CSV file,
obtaining a RDD by deleting the first row and removing the record with tax amount less than
1000. (5)
name education gender income tax
John U
M
60000
Jane H
F
Mike C
M
40000
Marry C
F
0
John U
M
40000
90000
8000
800
4000
1200
9000
Q8: Create a RDD with two columns: name and taxrate = tax / income if income > 0, otherwise
0. Then sort people by ascending order of taxrate (10)
Q9: Calculate the average income for the education group and gender group respectively ((5)
Q10: Using sc.textFile() read data from the CSV file, obtain a RDD by deleting the first row. Now
you need to convert the RDD into spark data frame using the header as the field names. Then
print the data frame. (10)
Q11: Register the Data Frame above as a table called profile and use spark SQL to get the
average net income (income subtract tax) within each (education, gender) group (10)
Q12: create a new column ‘net’ in the DF above. The rule is, if income>tax>0 then net=income –
tax if income=tax=0, then net = 1000, else/otherwise net = 0, then calculate the max(net)
within each education group (shown list, such as [(‘C’, 36000.0), (‘H’, 89200.0), (‘U’, 52000.0)] ).
Note, you need to use API of DF (rather than spark DF SQL) to solve this question (10)
Q13 Bonus 30 – add to total mark up to 100): The following questions are designed to practice
Object-oriented programming(OOD). We here have defined a class invest:
import random
class invest():
tot_return = 0.0
def __init__(self, investprop1 = 0.5, goodinvest = 0.65, seed = 7):
self.gain_rate1 = None
self.gain_rate2 = None
self.loss_rate1 = None
self.loss_rate2 = None
self.investprop1 = investprop1
self.goodinvest = goodinvest
self.seed = seed
self.is_loss = None
self.yearly_return = None
If you determine to invest a certain amount of fund (e.g. \$1000) at the beginning of each year
and then draw the total amount (with all returns) at the end of this year, then you invest the all
the fund (i.e. total balance including gain or loss since the first year’s investment)….
Each year you split the total fund into two investment portfolios based on a proportion value -investprop1 (a class variable, which means you choose to invest amount investprop1*\$1000 on
the first portfolio and (1- investprop1)*\$1000 on the second portfolio).
In addition, every year can be ‘good’ or ‘bad’ based on the probability value ‘goodinvest’ (such
as 30%). If the event is ‘good’ by a random draw with the probability ‘goodinvest’, i.e. if the
class variable ‘is_loss’ is 0, then you will get fund gaining based on the interests rate
‘gain_rate1’ and ‘gain_rate2’ for two portfolios respectively, otherwise you will suffer fund loss
based on the loss rate ‘loss_rate1’ and ‘loss_rate2’ for two portfolios respectively.
Please keep writing the Python codes to define the following class methods to realize the
functionalities above:
def lossprob(self): -> to determine ‘is_loss’ is 0 or 1 and print result
def getyearly(self): -> to determine total fund amount at n-th years’ investment n=1,..
def totalfund(self): -> to report total fund after n-th years’ investment, n=1,..
Then you can try or implement the class by generating several instances (objects) by passing
the values of class features such as ‘gain_rate1’, ‘gain_rate2’…, then do investment and print
total return each year.

attachment

## Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
\$26
The price is based on these factors:
Number of pages
Urgency
Basic features
• Free title page and bibliography
• Unlimited revisions
• Plagiarism-free guarantee
• Money-back guarantee
On-demand options
• Writer’s samples
• Part-by-part delivery
• Overnight delivery
• Copies of used sources
Paper format
• 275 words per page
• 12 pt Arial/Times New Roman
• Double line spacing
• Any citation style (APA, MLA, Chicago/Turabian, Harvard)

# Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

### Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

### Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

### Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.